TOC 0. Objectives 1. Introduction 2. Alternative Splicing Data Analysis 3. Single-cell RNA-Seq Data Analysis 1) Clustering of Cells Not Using Marker Genes (Only SC RNA-Seq Data) 2) Single-cell DEG Analysis 3) Reconstruction of Single-Cell Trajectory 4. RNA Editing
It’s possible to perform a detailed data analysis at nucleotide and single variant level with RNA-Seq. RNA-Seq is now being widely applied to not only transcript expression profile analysis, but also in miRNA-Seq and miRNA DEG, alternative splicing type calling, lncRNA (long non-coding RNA) analysis, fusion-gene detection (SV), RNA editing and single-cell transcriptome analysis, etc. In this chapter we’re learning how to perform alternative splicing analysis with rMATS, how to do scRNA-Seq transcriptome analysis via single-cell genomic amplification technology, and RNA editing data analysis with RCARE.
With the ever-accelerating pace of developments in NGS technology, more and more studies are being accmulated in transcriptome data research at base (nucleotide)-level. The most prominent area of transcriptome research are RNA-Seq and miRNA-Seq, but there are also other important areas of application like splicing analysis, lncRNA (long non-coding RNA), fusion gene detection, RNA editing detection, and single-cell RNA-seq analysis, etc. Apparently lncRNA plays an important role in development and maintenance, and the overall physiology of cancer according to this article. They are involved in chromatin remodeling, as well as transcriptional and post-transcriptional regulation through a variety of chromatin-based mechanisms and via cross-talk with other RNA species. They can act as decoys, scaffolds, and enhancer RNAs. As much as 85% of the human genome is transcribed, but only a small portion of them are protein coding transcripts. Non coding RNAs (ncRNAs, henceforth) can be roughly classified into two groups: short RNAs less than 200 nucleotides(nt) in length like miRNAs (microRNAs: 21-24nt in length) and piwi-interacting RNAs (piRNAs). On the other hand, long ncRNAs (lncRNAs) also exist, with length of around 200 nt or more. While miRNAs have been studied heavily, lncRNAs are less understood. Dysregulation of lncRNAs has been noted in glioblastoma, breastcancer, and leukemia. Such dysregulation commonly exerts impacts on cellular functions like cell proliferation, resistance to apoptosis, induction of angiogenesis, promotion of metastasis, and evasion of tumor suppressors. An emerging view on lncRNAs is that they are fundamental transcriptional regulators. It’s been quite an issue as to how its molecular intricacies work and several models of mechanism have been proposed, such as functioning as signal, decoy, scaffold, guide, enhancer RNAs, and short peptides. lncRNA can signal certain transcriptional activities (a hint?). Also it works as a decoy to provide non-functional binding sites
In DEG methods, we use DEGSeq (Wang et al., 2010), DESeq (Anders and Huber, 2010), edgeR (Robinson et al., 2010), baySeq (Hardcastle and Kell, 2010). In alternative splicing analysis we use MISO (Katz et al., 2010), rMATS(Shen et al., 2014), SUPPA (Alamancos et al., 2014), etc. miRNA-seq technology is for analyzing miRNA transcripts, which is part of non-coding RNAs. Some non-coding RNAs serve gene expression regulatory functions as well as protecting and preserving self-DNAs aganist invasive DNA that originate from external sources and assist in DNA synthesis and stability. miRNAs are also being investigated and applied in cancer diagnoses and prognoses prediction (Hayes et al., 2014: Thomas et al., 2015: Khan et al., 2015). You can also find cancer-specifically expressed miRNAs by DEG analysis via edgeR and DESeq2 (Love et al., 2014).
Single-cell sequencing is an optimized NGS technology that provides insights into the functions of each individual cell in the rich context of microenvironment that surrounds such cells. scRNA-seq (single-cell RNA sequencing) can provide us with the expression profile of individual cells. It is unrealistic to expect to obtain information about every RNA in the cell, but certainly the expression pattern is there and it can be analyzed through gene cluster analyses.
This practice covers the basics and real-data case study of alternative splicing data analysis, miRNA DEG analysis, single-cell tarnscriptome analysis, and RNA editing data analysis.
Question: How to discern specifically miRNAs or lncRNAs from a pool of transcriptome? Answer: Easy! Length.
Alternative splicing is a phenomenon found in eukaryotic cells. It’s a way of getting more genetic diversity from the limited information space of genomic DNA. It’s known that there are about 22,000 human genes and alternative splicing provides a leeway into producing more diverse proteins. However, base-level analysis of transcript was difficult before the advent of RNA-seq. Albeit, there have been attempts to use microarrays with exon-junction regions as probes and tiling-array technology for transcript analyses although not it proved not sufficient. To be honest, RNA-seq based alternative splicing analysis was also unstable at the beginning, but nowadays there are various anlytical tools and analytic methods introduced to the field.
Percent Spliced In (PSI) value is a quantitative value that tells us how many of exons and introns of interest (if these sequences are included in the reads, those reads are called as “spliced-in’s”) are detected within the transcriptome data. Basically, it’s number of reads that are supportive of containing a certain isoform divided by the sum of reads containing that isoform and reads that are exclusive of that isoform.
\[\textrm{PSI (percent spliced-in)} = \frac{\textrm{Short reads that contain a certain isoforms}}{\textrm{(Short reads w/ isoforms) + (Short reads w.o. isoforms)}} \cdot 100 \%\]
We use rMATS, which is the alternative splicing analysis tool, to produce our own PSI value from RNA-seq data. (Then we would need 1. the exons and introns of interest, and 2. transcriptome data from RNA-seq, 3. quite possibly annotation data?)
What I imagine the whole process is going to be is like this: RNA-seq raw reads \(\rightarrow\) aligned to reference (bam) and assembled to transcripts \(\rightarrow\) annotated (gtf) \(\rightarrow\) rMATS (splicing analysis!) \(\rightarrow\) PSI value.
It actually goes like this:
RNA-seq fastq file \(\rightarrow\) Read alignment \(\rightarrow\) RNA-seq bam file \(\rightarrow\) Detection of alternative splicing events with rMATS \(\rightarrow\) Visualization of alternative splicing regions
Now look at the bam file data that’s already been aligned, check out its contents and characteristics with the following command.
(base) soh1@soh1:~/GDA/rMATS/testData$ samtools view 231ESRP.25K.rep-1.bam | less
The data looks like this:
HWI-ST568_0055:2:1101:14836:3706#TGACCA 353 chr1 17941 1 50M = 18296 405 GGTGCTTGCTCTGGATCCTGTGGCGGGGGGGGCTCTGCAGGCCAGGGGCC ggggggggggggggggggggggggggegdBBBBBBBBBBBBBBBBBBBBB NM:i:3 NH:i:4 CC:Z:chr15 CP:i:102513175 HI:i:0
Let’s review sam format. It’s got read information (QNAME: query name), bitwise flag (FLAG: Information about pairing, strand, mate strand, etc. For example, 353 above would be translated to 1+32+64+256 and so 0x1, 0x20, 0x40, 0x100, which is 101100001 in binary.1), chromosome (RNAME: reference sequence name), position (POS: 1-based leftmost position of clipped alignment), mapping quality by phred-scale (MAPQ), CIGAR string (CIGAR: Extended CIGAR string (operations:MIDNSHP)), mate reference name (MRNM, “=” if same as RNAME), 1-based leftmost mate position (MPOS), inferred insert size (ISIZE), sequence on the same strand as the reference (SEQQuery), and query quality score (QUAL: ASCII-33=Phred base quality). Refer to this article in biostars for further information.
Now, use the BAM file as input to rMATS and find out how much alternative splicing occurred according to each type. rMATS computes PSI value for each gene between two comparison groups and expressed the significance of difference between the two PSI values with p-value. This process takes about 10 minutes.
Usage of options of rMATS are like this:
'single', paired-end 'paired'python2 ../rMATS.3.2.2.beta/RNASeq-MATS.py -b1 ./231ESRP.25K.rep-1.bam,./231ESRP.25K.rep-2.bam -b2 ./231EV.25K.rep-1.bam,./231EV.25K.rep
-2.bam -gtf ../rMATS.3.2.2.beta/gtf/Homo_sapiens.Ensembl.GRCh37.72.gtf -o bam_test -t paired -len 50 -c 0.0001 -analysis U -libType fr-firststrand
cd bam_test
less 'log.RNASeq-MATS.2020-08-13 13:29:08.771400.txt'
If I look into 'log.RNASeq-MATS.2020-08-13 13:29:08.771400.txt', then it’s that way.
2020-08-13 13:29:08,781 ################### folder names and associated input files #############
2020-08-13 13:29:08,782 SAMPLE_1\REP_1 ./231ESRP.25K.rep-1.bam
2020-08-13 13:29:08,782 SAMPLE_1\REP_2 ./231ESRP.25K.rep-2.bam
2020-08-13 13:29:08,782 SAMPLE_2\REP_1 ./231EV.25K.rep-1.bam
2020-08-13 13:29:08,782 SAMPLE_2\REP_2 ./231EV.25K.rep-2.bam
2020-08-13 13:29:08,782 #########################################################################
2020-08-13 13:29:08,782 start mapping..
2020-08-13 13:29:08,782 bam files are provided. skip mapping..
2020-08-13 13:29:08,782 done mapping..
2020-08-13 13:29:08,782 getting uniquely mapped reads or pairs
2020-08-13 13:29:08,782 getting unique SAM function..
2020-08-13 13:29:08,784 getting uniquely mapped reads or pairs for sample_1, rep_1 is done with status 512
2020-08-13 13:29:08,784 error in getting uniquely mapped reads from smaple_1, rep_1: 512
2020-08-13 13:29:08,784 error detail: awk: line 2: function and never defined
[main_samview] failed to write the SAM header
samtools view: error closing standard output: -1
2020-08-13 13:29:08,784 Retry up to 3 more times..
2020-08-13 13:29:08,784 Retry: 1
2020-08-13 13:29:08,786 getting uniquely mapped reads or pairs for sample_1, rep_1 is done with status 512
2020-08-13 13:29:08,786 error in getting uniquely mapped reads from smaple_1, rep_1: 512
2020-08-13 13:29:08,786 error detail: awk: line 2: function and never defined
[main_samview] failed to write the SAM header
samtools view: error closing standard output: -1
2020-08-13 13:29:08,786 Retry: 2
2020-08-13 13:29:08,788 getting uniquely mapped reads or pairs for sample_1, rep_1 is done with status 512
2020-08-13 13:29:08,788 error in getting uniquely mapped reads from smaple_1, rep_1: 512
2020-08-13 13:29:08,788 error detail: awk: line 2: function and never defined
[main_samview] failed to write the SAM header
samtools view: error closing standard output: -1
2020-08-13 13:29:08,788 Retry: 3
2020-08-13 13:29:08,790 getting uniquely mapped reads or pairs for sample_1, rep_1 is done with status 512
2020-08-13 13:29:08,790 error in getting uniquely mapped reads from smaple_1, rep_1: 512
2020-08-13 13:29:08,790 error detail: awk: line 2: function and never defined
[main_samview] failed to write the SAM header
samtools view: error closing standard output: -1
2020-08-13 13:29:08,790 There is an exception in getting unique SAM files
2020-08-13 13:29:08,790 Exception: <type 'exceptions.Exception'>
2020-08-13 13:29:08,790 Detail:
The error says it all. function “and” never defined in line 2.
Apparently there have been issue reported about this: link1, link2
When I look for the awk version with awk -W version, it looks like this
(base) soh1@ssoh1:~/GDA/rMATS/rMATS.3.2.2.beta$ awk -W version
mawk 1.3.4 20200120
Copyright 2008-2019,2020, Thomas E. Dickey
Copyright 1991-1996,2014, Michael D. Brennan
random-funcs: srandom/random
regex-funcs: internal
compiled limits:
sprintf buffer 8192
maximum-integer 2147483647
According to this stackoverflow thread and this manual for mawk, mawk is a minimal-featured awk designed for speed of execution over functionality, written by Mike Brennan. Therefore it may behave differently from gawk or a POSIX awk. Although its man page claims that it’s POSIX compliant, it seems that it’s really not POSIX-compliant in practice.
So I think I would have to install gawk with sudo apt-get install gawk command to properly work with the code.
I check for awk, and type in awk --version, which should work with gawk, the output is like this:
(base) soh1@soh1:~/GDA/rMATS/rMATS.3.2.2.beta$ awk --version
awk: not an option: --version
So what’s awk? awk stands for Aho + Weinberger + Kernighan (Alfred V. Aho, Peter J. Weinberger, Brian W. Kernighan). awk’s purpose is to select record from a file, pull that record of particular pattern, and manipulate or datarize the selected portion of record values. For example, you can perform various stuff with awk like: - printing out the entire content of a text file, - print only specific fields of a file, - add or insert a string of interest into specific fields that satisfy a filtering condition and print those fields, - print records that contain a pattern - print the result of certain operation (calculation) on a specific field(s) - print differentially according to how field values compare.
For in-depth learning experience of awk, visit this blog called Recipes for Developer.
So now I install gawk.
(base) soh1@soh1:~/GDA/rMATS/rMATS.3.2.2.beta$ sudo apt-get install gawk
[sudo] soh1의 암호:
패키지 목록을 읽는 중입니다... 완료
의존성 트리를 만드는 중입니다
상태 정보를 읽는 중입니다... 완료
제안하는 패키지:
gawk-doc
다음 새 패키지를 설치할 것입니다:
gawk
0개 업그레이드, 1개 새로 설치, 0개 제거 및 0개 업그레이드 안 함.
418 k바이트 아카이브를 받아야 합니다.
이 작업 후 1,708 k바이트의 디스크 공간을 더 사용하게 됩니다.
받기:1 http://kr.archive.ubuntu.com/ubuntu focal/main amd64 gawk amd64 1:5.0.1+dfsg-1 [418 kB]
내려받기 418 k바이트, 소요시간 2초 (169 k바이트/초)
Selecting previously unselected package gawk.
(데이터베이스 읽는중 ...현재 368380개의 파일과 디렉터리가 설치되어 있습니다.)
Preparing to unpack .../gawk_1%3a5.0.1+dfsg-1_amd64.deb ...
Unpacking gawk (1:5.0.1+dfsg-1) ...
gawk (1:5.0.1+dfsg-1) 설정하는 중입니다 ...
Processing triggers for man-db (2.9.1-1) ...
Now if I testrun the code echo | awk'{print systime()}', it works fine…..? Not yet.
Then I run RNASeq-MATS.py again.
Fuck it. Whatever. I’m running it on server.
Instead, I used a conda environment specifically for rMATS.
There were warnings about RNASeq-MATS.py havnig error of indexing in line 214, which I looked into and found out that a variable called output wasn’t being split correctly. I looked up output and discovered that it was referring to lines,
myCmd = 'samtools view '+fq+' | head -n 1';
status,output=commands.getstatusoutput(myCmd);
and this meant that it’s gotta do something with samtools not being installed.
When I installed samtools, it still showed warnings and samtools was giving me errors in this issue log.
This was about openssl version compatibility as mentioned below, so I removed samtools and reinstalled along with downgraded openssl.
To summarize, this server has…:
conda install python=2.7.15)conda install numpy)conda install scipy)libcrypto.so.1.1.1 being referred to, which is apparently incompatible with samtools 1.9. So I installed libcrypto.so.1.0.0 that is included in openssl 1.0.0. (conda install c- bioconda samtools openssl=1.0)This solved the issue for me.
python ../rMATS.3.2.2.beta/RNASeq-MATS.py -b1 ./231ESRP.25K.rep-1.bam,./231ESRP.25K.rep-2.bam -b2 ./231EV.25K.rep-1.bam,./231EV.25K.rep
-2.bam -gtf ../rMATS.3.2.2.beta/gtf/Homo_sapiens.Ensembl.GRCh37.72.gtf -o bam_test -t paired -len 50 -c 0.0001 -analysis U -libType fr-firststrand
It gives me report (less summary.txt in bam_test) like this:
################### folder names and associated input files #############
SAMPLE_1\REP_1 testData/231ESRP.25K.rep-1.bam
SAMPLE_1\REP_2 testData/231ESRP.25K.rep-2.bam
SAMPLE_2\REP_1 testData/231EV.25K.rep-1.bam
SAMPLE_2\REP_2 testData/231EV.25K.rep-2.bam
#########################################################################
############################# MATS Report #############################
======================================================================
EventType NumEvents.JC.only SigEvents.JC.only NumEvents.JC+readsOnTarget SigEvents.JC+readsOnTarget
======================================================================
SE 35 1 (1:0) 49 1 (1:0)
MXE 14 0 (0:0) 22 0 (0:0)
A5SS 7 0 (0:0) 8 0 (0:0)
A3SS 8 0 (0:0) 10 0 (0:0)
RI 13 0 (0:0) 25 0 (0:0)
======================================================================
################# Report Legend ################
EventType: Type of AS event
SE: Skipped exon
MXE: Mutually exclusive exon
A5SS: Alternative 5' splice site
A3SS: Alternative 3' splice site
RI: Retained intron
NumEvents.JC.only: total number of events detected using Junction Counts only
SigEvents.JC.only: number of significant events detected using Junction Counts only
The numbers in the parentheses (n1:n2) indicate the number of significant events that have higher inclusion level for SAMPLE_1 (n1) or for SAMPLE_2 (n2)
NumEvents.JC+readsOnTarget: total number of events detected using both Junction Counts and reads on target
SigEvents.JC+readsOnTarget: number of significant events detected using both Junction Counts and reads on target
The numbers in the parentheses (n1:n2) indicate the number of significant events that have higher inclusion level for SAMPLE_1 (n1) or for SAMPLE_2 (n2)
################################################
If you want more detailed info, read SE.MATS.ReadsOnTargetAndJunctionCounts.txt in bam_test.
It contains information like this:
ID GeneID geneSymbol chr strand exonStart_0base exonEnd upstreamES upstreamEE downstreamES downstreamEE ID IC_SAMPLE_1 SC_SAMPLE_1 IC_SAMPLE_2 SC_SAMPLE_2 IncFormLen SkipFormLen PValue FDR IncLevel1 IncLevel2 IncLevelDifference
6461 "ENSG00000122566" "HNRNPA2B1" chr7 - 26237450 26237486 26237241 26237352 26240191 26240366 6461 14,13 1,0 0,0 13,13 85 49 5.977884853791693e-12 2.92916357836e-10 0.89,1.0 0.0,0.0 0.945
Giving us alternative splicing user-defined ID, gene ID, geneSymbol of spliced transcript, chromosome of splicing event, strand of splicing event, skipped exon start position, skipped exon end position, skipped exon upstream exon start, skipped exon upstream exon end, skipped exon downstream exon start, skipped exon downstream exon end, user defined ID yet again, inclusion counts for SAMPLE_1 (w/ replicates separated by comma), skipping counts for SAMPLE_1 (w/ replicates separated by comma), same for SAMPLE_2, same for SAMPLE_2, length of inclusion form (used for normalization), length of skipping form (used for normalization), p-value for differential splicing, FDR for differential splicing, inclusion level for SAMPLE_1 replicates from normalized counts, inclusion level for SAMPLE_2 replicates from normalized counts, average(IncLevel1) - average(IncLevel2).
Now, let’s draw the graph of exons that showed the most statistically significant alternative splicing from SE.MATS.ReadsOnTargetAndJunctionCounts.txt. This process takes about 5 min.
This requires rmats2sashimiplot package, so I installed it via conda install -c bioconda rmats2sashimiplot.
Apparently you can also visualize sashimi plots in IGV, so I read the following definition of sashimi plot: link. Or maybe the original paper publication on sashimi plot is here and here and here. Actually, from the Broad Institute homepage I found this link that officially documents about sashimiplot and its containing library MISO.
What is sashimi_plot?
Sashimi plots visualize splice junctions for multiple samples from their alignment data alongside genomic coordinates and a user-specified annotation track. It’s a utility for automatically producing publication-quality plots Sashimi plots for RNA-Seq analyses of isoform expression. It’s part of the MISO framework. In particular, it can 1) plot raw RNA-Seq densities along exons and junctions for multiple samples, while simultaneously visualizing the gene model/isoforms to which reads map, and 2) plot MISO output alongside the raw data or separately.
The genomic reads are converted into read densities (per-base expression on y-axis: RPKM) and it’s expressed as width in the plot. And junction reads are plotted as arcs whose width is proportional to the number of reads aligned to the junction spanning the exons connected by arc. (The reads that overlap over the exons)
So let’s get to it and plot our own sashimi plot.
cd testData
rmats2sashimiplot -b1 231ESRP.25K.rep-1.bam,231ESRP.25K.rep-2.bam -b2 231EV.25K.rep-1.bam,231EV.25K.rep-2.bam -t SE -e ~/GDA/rMATS/bam_test/MATS_output/SE.MATS.ReadsOnTargetAndJunctionCounts.txt -l1 ESRP -l2 EV -o test_events_output
Above options were wrong.
It should be changed to -b1 \(\rightarrow\) –b1, -b2 \(\rightarrow\) –b2, -l1 \(\rightarrow\) –l1, -l2 \(\rightarrow\) –l2.
Therefore I executed the following command again.
rmats2sashimiplot --b1 231ESRP.25K.rep-1.bam,231ESRP.25K.rep-2.bam --b2 231EV.25K.rep-1.bam,231EV.25K.rep-2.bam -t SE -e ~/GDA/rMATS/bam_test/MATS_output/SE.MATS.ReadsOnTargetAndJunctionCounts.txt --l1 ESRP --l2 EV -o test_events_output
Go into the output directory test_events_output/Sashimi_plot and see.
The red and orange plots’ width in the resulting figure means ESRP and EV transcipts’ read sequencing depths. If you compare the two isoforms in the bottom, ESRP seems to have an exon inclusion (one exon is included: IE), and EV seems to have an exon skipping (that particular exon is skipped entirely: SE) isoform. Black boxes denote exons and line with arrowheads denote introns. You can also note that although reads that occur in the splice junctions sometimes are not detected and arcs not showing, it doesn’t mean such combination of exons doesn’t exist.
Refer to this excellent report on the current state of Single-cell RNA sequencing tech.
Single RNA-Seq data analysis can be applied to tackle these problems:
Distinguish heterogeneity within a sample (heterogeneous sample)
Rare cell types (those that are hard to detect)
Cell evolution & Phylogenetics
Somatic mosaicism
Analysis of unculturable micro-organisms
Disease analysis
The whole experimental procedure goes like this:
Single-cell separation
Nucleic Acid Prep
Amplification
Library Prep
Sequencing and Bioinformatic Analysis
Physically separte different types of cells (perhaps by FACS?) and put in different barcode sequences with adapters. FPM normalization usually.
Single cell sequencing data contains at the least 20000~30000 gene expression information and therefore is high dimensional data. Takes a long time to actually take all these variables into account to perform computation would be a slow and painful process, might be vulnerable to noises, and contains too much of useless information. Therefore we remove statistically insignificant genes from the list (about 2000~3000 genes remaining). Additional dimension reduction techniques like PCA (principal component analysis) or CCA (canonical correlation analysis) are applied to group genes that show similar patterns of variability. The primary purpose of this step is to maximally exclude noises while retaining the biologically relevant and crucial variables.
After conversion, the data are coordinates in the low dimension and unique coordinates are assigned to each and every cell. Similar cells would have similar coordinates and thus clustered among themselves, forming a geometrical morphology. We need various geometrical approaches for understanding cell clustering. The most representative of them is a graph. A collection of point and lines form a graph and it’s very useful for simplifying complex morphology for analysis. K-nearest neighbor graph, which is the graph that connects k-number of dots (nodes) that are closes, is the most used form in single-cell RNA sequencing.
Data converted to graph is applied to various analyses. Cells that belong to a group form a well-connected cluster, and using algorithms that cut out such clusters are there for easy classification. Louvain clustering is used here. Discovering marker genes, classifying them into different types, and analyzing the features of each type (characteristics) is the primary process in single-cell sequencing.
Clustering is cutting out a particular portion of cells and classifying them, and at the opposite end is the approach where one studies the connections within the clusters. Cells in development or differentiation undergo continuous change in their expression profiles, and therefore the connection looks like a smooth cylindrical shape once expressed as a graph. Simplifying this connectivity structure, finding the central axis, and allocating the cells along the axis allows us to see the cell’s change in time. You could call it “time projected onto space”, and therefore is called “pseudo-time”. There are already many pseudo-time analysis methods developed out there and there are different optimal analyses according to the diversity of the structure. It’s better if you try many. One thing you gotta beware is that cells are aligned simply by the similarity of expression pattern, and this can’t be necessarily equated to actual development or differentiation steps. For example, when two different cell types turn into a strikingly similar cell type, it’s hard to discern which is the starting point and which is at the end point. From this problem, analytical method called velocyto that predicts where a cell is headed in terms of directionality based on a model that models ratio between introns and exons. The model is based on the premise that newly transcribed genes probably have higher ratio of intron sequences. This adheres to the common sense because, of course, freshly synthesized transcripts probably didn’t have time to get spliced and therefore more introns are present in the transcriptome. (Introns would probably be excised and degraded by cellular machinery. Refer to this article called Lariat intronic RNAs in the cytoplasm of vertebrate cells. It says that lariat RNAs are destroyed within minutes of formation in the cell nucleus.) However, many of intronic RNAs are exported to the cytoplasm where they remain as stable circular molecules and are thought to play unexpected roles in cell metabolism.
Usually implemented in R or Python. Seurat in R and Scanpy in Python are widely used. These packages are convenient because they provide the chain of tools for converting, storing, and analyzing these data. There are softwares that are specifically for pseudo-time analysis or canceling experimental variations. So it’s advisable to study them and put to appropriate applications.
Sequencing RNA and other markers (other macromolecules)
Single-cell Genomics
Merging with Imaging: Information about the cell’s location within the tissue disappears. That’s one main disadvantage of single cell sequencing. There are imaging techniques being developed to complement this shortcoming. Usually number of genes that can be viewed simultaneously decreases as resolution increases. If we find efficient marker combination, it might just be possible to merge transcriptome data and imaging data of the entire tissue together to see the expression information at single cell resolution. If we think of biological processes as comprising of three axes: time, space, and gene expression - then retrieving space and gene expression data without loss may become a reality.
We need: - devtools - ggplot2 - cowplot2 - dplyr: Data processing package, C++ version of plyr. Fast computing speed. Functions that pertain to data.frame, data.table, support for databases like MySQL, PostgreSQL, SQLite, BigQuery. Functions like filter() that conditionally extracts rows, and arrange(). and more are super useful. - Rmisc: Ryan Miscellaneous contains functions useful for data analysis and utility operations. https://cran.r-project.org/web/packages/Rmisc/index.html - monocle: Analysis toolkit for single-cell RNA-seq, can performn pseudotimne analysis of single-cell trajectories, clustering by gene expressin pattern, find differential expression (DEG), etc. http://cole-trapnell-lab.github.io/monocle-release/ - rhdf5: Interface between R and HDF5 (hierarchical data format version 5), which is a file format for storing large size data. - cellrangerRkit-2.0.0: demultiplexing, barcode counting, etc. https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger
We don’t have: - Rmisc - monocle - rhdf5 - cellrangerRkit-2.0.0
library(ggplot2)
library(cowplot) # Helps with creating publication-quality figures on top of ggplot2
##
## ********************************************************
## Note: As of version 1.0.0, cowplot does not change the
## default ggplot2 theme anymore. To recover the previous
## behavior, execute:
## theme_set(theme_cowplot())
## ********************************************************
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(monocle)
## Loading required package: Matrix
## Loading required package: Biobase
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following object is masked from 'package:Matrix':
##
## which
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which, which.max, which.min
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: VGAM
## Loading required package: stats4
## Loading required package: splines
## Loading required package: DDRTree
## Loading required package: irlba
library(cellrangerRkit)
## Loading required package: RColorBrewer
## Loading required package: bit64
## Loading required package: bit
##
## Attaching package: 'bit'
## The following object is masked from 'package:base':
##
## xor
## Attaching package bit64
## package:bit64 (c) 2011-2017 Jens Oehlschlaegel
## creators: integer64 runif64 seq :
## coercion: as.integer64 as.vector as.logical as.integer as.double as.character as.bitstring
## logical operator: ! & | xor != == < <= >= >
## arithmetic operator: + - * / %/% %% ^
## math: sign abs sqrt log log2 log10
## math: floor ceiling trunc round
## querying: is.integer64 is.vector [is.atomic} [length] format print str
## values: is.na is.nan is.finite is.infinite
## aggregation: any all min max range sum prod
## cumulation: diff cummin cummax cumsum cumprod
## access: length<- [ [<- [[ [[<-
## combine: c rep cbind rbind as.data.frame
## WARNING don't use as subscripts
## WARNING semantics differ from integer
## for more help type ?bit64
##
## Attaching package: 'bit64'
## The following object is masked from 'package:Biobase':
##
## cache
## The following objects are masked from 'package:BiocGenerics':
##
## %in%, match, order, rank
## The following objects are masked from 'package:base':
##
## %in%, :, is.double, match, order, rank
## Loading required package: Rmisc
## Loading required package: lattice
## Loading required package: plyr
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library(HSMMSingleCell) # Single-cell RNA-Seq for differentiating human skeletal muscle myoblasts
For future reference, this is a list of single-cell data packages provided through Bioconductor: link.
my_dir <- "./scRNA"
gbm <- load_cellranger_matrix(my_dir)
## Searching for genomes in: ./scRNA/outs/filtered_gene_bc_matrices
## Using GRCh38 in folder: ./scRNA/outs/filtered_gene_bc_matrices/GRCh38
## Loaded matrix information
## Loaded gene information
## Loaded barcode information
## Could not find summary csv:
## ./scRNA/outs/metrics_summary.csv.
## This file is only necessary if you are performing depth-normalization (calling the equalize_gbms function) in R.
## If this pipestance was produced by `cellranger aggr` with the default parameters, depth-normalization in R (via equalize_gbms) is not necessary.
# check the object gbm's class
class(gbm)
## [1] "GeneBCMatrix"
## attr(,"package")
## [1] "cellrangerRkit"
attr(gbm, "package")
## NULL
dim(exprs(gbm))
## [1] 33694 8381
exprs: These generic functions access the expression and error measurements of assay data stored in an object derived from the eSet-class. eSet is a class of container to contain high-throughput assays and experimental metadata. These contain one or more identical-sized matrices as assayData elements. Derived classes (e.g., ExpressionSet-class, SnpSet-class) specifiy which elements must be present in the assayData slot.
eSet object cannot be instantiated directly. It’s a virtual class, so instances cannot be created. Objects created under previous definitions of eSet-class can be coerced to the current classes derived from eSet using updateOldESet.
eSet object contains slots of assayData, phenoData, featureData, experimentData, annotation, protocolData, .__classVersion__ and methods of sampleNames, featureNames, dimnames, dims, phenoData, pData, varMetadata, varLabels, featureData, fData, assayData, experimentData, etc.
Aside all this, there’s this awesome site that explains single-cell genomics neatly. It’s a blog run by Rutgers Stem Cell Research Center led by Professor Hart. It says the 10X Chromium system has become the gold standard for single-cell sequencing and therefore 10X Genomics’ Cell Ranger software is
exprs(gbm): Retrieve Expressionexprs(gbm)[1:5, 1:5]
## 5 x 5 sparse Matrix of class "dgTMatrix"
## AAACCTGAGCATCATC-1 AAACCTGAGCTAACTC-1 AAACCTGAGCTAGTGG-1
## ENSG00000243485 . . .
## ENSG00000237613 . . .
## ENSG00000186092 . . .
## ENSG00000238009 . . .
## ENSG00000239945 . . .
## AAACCTGCACATTAGC-1 AAACCTGCACTGTTAG-1
## ENSG00000243485 . .
## ENSG00000237613 . .
## ENSG00000186092 . .
## ENSG00000238009 . .
## ENSG00000239945 . .
pData(gbm): Retrieve experimental phenotypes (phenoData)dim(pData(gbm))
## [1] 8381 1
head(pData(gbm))
## barcode
## AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1
## AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1
## AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1
## AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1
## AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1
## AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1
fData(gbm): Retrieve Feature Datahead(fData(gbm))
## id symbol
## ENSG00000243485 ENSG00000243485 RP11-34P13.3
## ENSG00000237613 ENSG00000237613 FAM138A
## ENSG00000186092 ENSG00000186092 OR4F5
## ENSG00000238009 ENSG00000238009 RP11-34P13.7
## ENSG00000239945 ENSG00000239945 RP11-34P13.8
## ENSG00000239906 ENSG00000239906 RP11-34P13.14
The main object class in Monocle is the CellDataSet object and it’s what it requires its data to be housed in. CellDataSet extends Bioconductor’s ExpressionSet class and the same basic interface is supported. to get started we need to create a CellDataSet object with the newCellDataSet() function. The CellDataSet object was derived from the ExpressionSet class, so it’s easy to create, since the gbm object was also derived from the same class. However, Monocle expects that the gene symbol column in the feature data is called gene_short_name or else you will get a warning. (Make sure that gene symbol column in the feature data is called gene_short_name.)
my_feat <- fData(gbm)
head(my_feat, 5)
## id symbol
## ENSG00000243485 ENSG00000243485 RP11-34P13.3
## ENSG00000237613 ENSG00000237613 FAM138A
## ENSG00000186092 ENSG00000186092 OR4F5
## ENSG00000238009 ENSG00000238009 RP11-34P13.7
## ENSG00000239945 ENSG00000239945 RP11-34P13.8
names(my_feat) <- c('id', 'gene_short_name') # Gene Symbol column's name must be changed to gene_short_name
# in order to avoid warnings.
CellDataSet object of R Monocle package with newCellDataSet()my_cds <- newCellDataSet(exprs(gbm),
phenoData = new("AnnotatedDataFrame", data=pData(gbm)),
featureData = new("AnnotatedDataFrame", data=my_feat),
lowerDetectionLimit = 0.5,
expressionFamily = negbinomial.size())
negbinomial.size() = Negative Binomial Distribution Family Function With Known Size, Maximum likelihood estimation of the mean parameter of a negative binomial distribution with known size parameter.
my_cds
## CellDataSet (storageMode: environment)
## assayData: 33694 features, 8381 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: AAACCTGAGCATCATC-1 AAACCTGAGCTAACTC-1 ...
## TTTGTCATCTGCTTGC-1 (8381 total)
## varLabels: barcode Size_Factor
## varMetadata: labelDescription
## featureData
## featureNames: ENSG00000243485 ENSG00000237613 ... ENSG00000268674
## (33694 total)
## fvarLabels: id gene_short_name
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
my_cds <- estimateSizeFactors(my_cds)
my_cds <- estimateDispersions(my_cds)
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Removing 146 outliers
`group_by_()` is deprecated as of dplyr 0.7.0.
Please use `group_by()` instead.
See vignette('programming') for more help
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.`select_()` is deprecated as of dplyr 0.7.0.
Please use `select()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
Doesn’t work with deprecated dplyr. So I downgraded it.
Doesn’t work yet again, and the program complains about rlang version being too high. Let’s install rlang 0.3.0.
Whatever. It works well on Windows.
detectGenes(cell_data_set, min_expr = 0.1) funcion outputs all cells that have genes expressed over 0.1 and also outputs all genes expressed in all cells.
my_cds <- detectGenes(my_cds, min_expr = 0.1) # Gene that is expressed over 0.1
head(fData(my_cds))
## id gene_short_name num_cells_expressed
## ENSG00000243485 ENSG00000243485 RP11-34P13.3 0
## ENSG00000237613 ENSG00000237613 FAM138A 0
## ENSG00000186092 ENSG00000186092 OR4F5 0
## ENSG00000238009 ENSG00000238009 RP11-34P13.7 24
## ENSG00000239945 ENSG00000239945 RP11-34P13.8 1
## ENSG00000239906 ENSG00000239906 RP11-34P13.14 0
summary(fData(my_cds)$num_cells_expressed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 5.0 349.9 234.0 8381.0
sum(exprs(my_cds['ENSG00000239945',]))
## [1] 1
sum(exprs(my_cds['ENSG00000238009',]))
## [1] 24
x <- pData(my_cds)$num_genes_expressed
Standardization
x_1 <- (x-mean(x)) / sd(x)
summary(x_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.8839 -0.6155 -0.2282 0.0000 0.2946 7.0675
df <- data.frame(x=x_1)
ggplot(df, aes(x)) + geom_histogram(bins=50) + geom_vline(xintercept = c(-2,2), linetype = "dotted", color="red")
Here’s another example with PBMCs (peripheral blood mononuclear cells) from a healthy donor.: https://davetang.org/muse/2017/10/01/getting-started-monocle/. The book’s contents are probably derived from this post.
So there are only a few number of cells that have significantly expressed genes. For example, Cells that are 5 sd-distances away from the average are discovered and labeled as expressed genes.
Many mRNAs are transcribed from one gene. Unique molecular identifier (UMI) can amplify and sequence each mRNAs and align them to find mRNA variants. Alright, so you have index sequence that can uniquely hybridize to each transcripts attached to 10 base long degenerate molecular tag (random) as a probe (probe = index sequence that can identify 5’ poly A tail + 10 base long degenerate molecular tag). This hybridizes to the target mRNA and individually amplify each RNA and individually sequence. This way we can consistently retrieve mRNA sequences from the same gene that are sequenced independently. And with alignment of these indenpendently sequenced mRNAs, we can discover variants in mRNAs that are expressed at low levels. However, there’s a drawback with short transcripts of length below 500 being over-sequenced and over-represented. Let’s add meta-data like UMI into phenoData and print it out.
# Add a UMI column into phenoData
# Use Matrix because data is in the sparse format
pData(my_cds)$UMI <- Matrix::colSums(exprs(my_cds))
head(pData(my_cds)$UMI)
## [1] 2394 1694 4520 2788 4667 4440
What does using matrix because the data is in sparse format?
So when I looked it up on Wikipedia, sparse format is a matrix where most of the elements are zero.
my_cds$num_genes_expressed
## [1] 871 806 1316 898 1526 1495 1253 1433 1632 1134 1176 1900 1548 1367
## [15] 1390 1414 1324 1174 1607 1768 1132 2214 958 1333 630 1035 1151 1001
## [29] 1478 1714 947 1155 979 1408 2686 1083 1166 2456 663 1387 1045 1927
## [43] 530 1518 1202 1249 1481 2112 997 1576 1706 967 968 1306 1701 1023
## [57] 1171 1138 1060 1134 1477 1390 937 1136 1078 1826 2819 1172 1326 1359
## [71] 1647 826 1080 1606 1090 741 1232 1010 1323 675 842 3171 1261 1354
## [85] 1899 1581 1958 1175 1394 1304 1415 1691 729 1225 2256 1219 1741 1492
## [99] 3062 2015 1458 1546 1602 1013 1448 1416 992 1347 1463 1153 1186 1819
## [113] 1174 1654 1294 1182 1106 1007 692 1784 1132 1353 1042 1010 1592 1230
## [127] 1150 1382 1304 1965 734 1627 1595 2770 1593 1170 1476 1244 1620 1439
## [141] 1281 1132 1301 823 1323 1250 1286 1646 1025 2201 2267 1519 1666 1357
## [155] 1110 2330 940 1192 1178 1051 1177 1097 1730 1290 1140 1401 1146 905
## [169] 1516 1230 1610 934 2512 2633 1612 1288 1322 880 1229 1337 1587 1081
## [183] 1227 1339 1192 1360 1338 2090 1223 727 775 1922 854 1339 1627 1104
## [197] 1963 949 2231 1984 1944 1153 1549 1174 1878 1075 1486 1436 1660 2229
## [211] 1681 1382 1023 1537 1262 1205 1040 1322 774 1820 1075 1028 1433 1214
## [225] 1884 2280 1296 2020 1203 3544 1522 1051 1557 984 2065 2235 969 1656
## [239] 1013 1483 1292 1319 1273 1094 964 1681 1024 1111 1252 1040 1226 1517
## [253] 1466 2324 1491 1193 1606 780 1262 1151 1176 1105 925 1779 1982 1358
## [267] 675 916 1031 919 1504 1690 1308 666 1269 1404 1472 739 2230 1422
## [281] 1833 1549 1130 1420 1276 2077 2036 1649 1260 938 2553 1607 1311 1195
## [295] 1134 1327 1196 1147 1342 1359 859 1394 645 1890 2683 1374 2783 1339
## [309] 1660 1761 1235 1877 1188 1735 1160 794 635 1010 1082 1485 1702 1389
## [323] 2304 1419 1917 1455 877 1316 1687 1953 2345 3118 1373 1221 952 1371
## [337] 1150 1094 1318 1526 1384 1331 1198 2770 4062 2578 1818 1625 714 1349
## [351] 2701 2163 1659 812 1020 1975 769 1744 1224 1009 1419 965 1234 1476
## [365] 1186 733 2060 715 2717 1629 1413 815 769 657 1195 2756 1376 1009
## [379] 760 1153 3278 1243 971 1260 1169 931 754 1275 1131 1484 793 1236
## [393] 1426 978 1094 2172 1045 908 1103 1424 1134 1793 1470 1418 1240 1213
## [407] 1481 958 1288 1417 1550 1067 1561 954 2882 1337 1474 1680 1640 734
## [421] 1656 2891 1210 1364 1459 1625 1378 1538 811 1336 1366 1394 1295 1425
## [435] 1443 1241 1126 1283 1342 934 2092 2004 1025 2615 1294 1568 1089 1554
## [449] 1665 1576 1179 1175 1100 1159 1032 1177 3425 1350 1650 1119 1048 1653
## [463] 1585 1734 1290 1259 1661 1096 1139 1489 1075 864 1083 1190 1196 1071
## [477] 1163 1082 1270 1007 1188 1219 1184 1122 1010 1043 1190 1637 1164 1429
## [491] 1316 1438 1509 1350 1435 2061 1439 1050 2453 1119 1029 1379 1319 1257
## [505] 1563 2383 1339 1928 1370 1674 2557 789 1389 1566 1143 923 1087 1400
## [519] 1213 1218 1116 1104 1684 942 1258 3444 1314 1775 1176 2023 1423 1459
## [533] 1289 1435 1057 928 1289 1193 2068 921 2071 1311 1449 1186 1328 1901
## [547] 1432 1342 2016 1595 1631 1221 1231 1580 2082 1319 1261 922 1361 1517
## [561] 1107 1338 1349 1008 1575 966 4166 1512 1120 2521 1207 1176 1251 1684
## [575] 999 1337 1289 1386 1401 1656 1188 984 1269 1624 1389 799 1566 2241
## [589] 1084 1396 1167 1282 1550 1480 1482 1029 2651 1377 1310 1246 1590 1278
## [603] 1612 970 1353 1253 1259 1048 841 1827 962 1206 1248 1156 1319 1152
## [617] 1439 1136 1388 635 1231 1107 1230 1491 1094 956 1343 1571 1401 1327
## [631] 1098 1518 1532 1369 1892 1398 1488 1423 1370 1763 1384 1040 2333 1375
## [645] 1364 3596 990 1460 1182 1292 1127 991 1400 1712 1114 1221 1969 1569
## [659] 1201 1438 1030 1243 1573 1186 1107 1054 1022 1033 1557 1062 1116 1149
## [673] 1359 1329 2610 1195 2128 1218 1348 1437 1426 2138 1631 1016 1292 1267
## [687] 1060 1259 659 1835 1696 1605 1162 3199 2137 1482 1511 1069 773 1992
## [701] 1163 4148 1652 1472 1695 1228 1141 2125 1215 1481 2968 656 1585 959
## [715] 3046 1549 1209 1448 1399 1879 1444 709 1867 1187 1142 1239 2217 2935
## [729] 1331 844 1356 1142 1739 915 1127 1276 949 990 1512 1168 1287 1102
## [743] 1502 2519 2216 1503 1442 994 1057 1531 1499 2108 2202 840 1480 1361
## [757] 1167 1149 1002 1159 784 2007 997 2042 1203 2655 1051 1050 2019 1475
## [771] 1123 2017 1286 1963 731 1556 1422 1187 1157 1435 925 1301 1003 1192
## [785] 1140 1325 1638 3213 2517 1506 1169 1666 2052 1766 1676 916 1296 1278
## [799] 1327 3360 1608 1244 831 1035 1124 1477 1005 1530 1060 2139 1221 1742
## [813] 1110 1372 1265 1450 1628 1156 811 2845 1312 946 1136 1252 1329 1284
## [827] 963 1051 1540 1223 1457 707 1097 1184 918 1299 1813 1590 1245 1106
## [841] 2810 1328 1464 3138 1799 2486 1150 1500 1072 3141 1106 1220 1531 1821
## [855] 955 1217 1332 3029 1335 2587 1046 1962 2538 1607 778 955 1292 2240
## [869] 751 950 1427 1365 1102 2426 1902 1316 1709 1130 3038 1540 1584 1234
## [883] 3039 3222 1168 1123 1240 914 1360 1743 1118 1376 1157 681 1323 1044
## [897] 1302 1084 1757 1847 992 1149 689 1529 1256 1475 1561 1668 891 1933
## [911] 2828 1137 2056 1310 2679 1447 1534 1297 1202 1875 1256 937 1450 1289
## [925] 1243 1913 1260 1180 1129 1213 2298 2248 1546 1056 1635 1296 1497 1398
## [939] 1498 1270 1458 2268 2132 954 1784 1821 1459 1416 1050 1238 1904 1117
## [953] 1499 2431 1326 1700 1845 1432 1310 1429 1434 1241 841 1663 1496 1263
## [967] 1348 1110 965 1337 973 1227 1099 3068 1017 995 1796 1017 2541 1433
## [981] 1138 1443 2309 1264 1235 1224 1057 1473 1106 2007 1721 930 933 1172
## [995] 1093 2736 1650 926 1060 1409 1746 1263 2004 1487 988 1053 1097 1480
## [1009] 1845 864 1559 1002 1274 1922 910 1642 1334 801 2642 1505 1285 1448
## [1023] 1415 959 2843 1452 1140 972 1213 3065 894 1245 1533 1285 1193 2393
## [1037] 1089 1208 1854 1349 1094 940 1269 1179 1285 1718 1432 1363 868 1712
## [1051] 1217 920 1466 1076 1146 1680 1677 4047 1106 1005 1076 2847 1419 1091
## [1065] 2162 1112 1412 1075 1346 916 936 846 2870 2916 1088 1259 1366 1210
## [1079] 1013 2913 1363 1520 1948 961 1594 1288 1785 1398 1130 3180 1253 1562
## [1093] 1257 1363 2168 2279 1107 885 1412 2194 2766 1279 1052 1155 1425 1436
## [1107] 1902 901 1155 1580 1329 839 3105 1240 1153 1264 3645 1450 1353 3501
## [1121] 1336 1727 1032 1459 4075 1502 944 914 1115 1935 1444 1936 1037 1386
## [1135] 1314 2221 1314 1141 1601 1354 1086 1222 1494 1194 1078 1527 1105 1151
## [1149] 1212 976 2080 1042 926 1376 1525 823 1579 1221 1303 1179 2147 1489
## [1163] 1221 2115 949 1452 1473 1178 1454 1420 1166 1493 821 1128 1549 1017
## [1177] 2251 850 1612 2944 2315 2269 1006 1096 1128 1312 1403 1315 892 954
## [1191] 1059 1056 1287 1900 1409 1196 1630 1074 1633 1441 2244 1804 1068 1507
## [1205] 1633 998 854 926 1272 833 974 1303 1061 1068 1744 1331 1921 1174
## [1219] 1175 1990 2485 1109 1101 1505 1823 1439 1904 1378 1428 1336 880 1622
## [1233] 2093 1293 2218 1352 1093 1455 960 1079 1418 1243 1142 875 1731 2356
## [1247] 804 1540 1004 891 1019 923 1141 1035 1285 823 1110 1159 761 1517
## [1261] 1233 1328 1240 866 1190 1055 1377 1124 1164 2451 1079 895 1756 999
## [1275] 1712 1457 1556 1391 1455 1455 1028 1421 1332 1159 1086 1632 1535 1052
## [1289] 2086 1001 1512 1529 1651 1297 1571 1536 1629 1443 1333 1043 1320 1194
## [1303] 1867 1194 1918 1276 2073 770 1282 1218 1511 1213 1385 1573 1295 2522
## [1317] 1914 1977 945 1776 1270 2349 2206 1474 1639 1216 1101 1259 917 1050
## [1331] 737 1808 863 1198 2475 1497 1174 898 1169 1263 1512 855 1263 1111
## [1345] 1264 1008 1168 1428 877 1286 1426 1066 1361 2182 1298 1467 959 1357
## [1359] 808 881 1436 1502 1391 4281 1923 997 1334 1212 1457 998 1828 1693
## [1373] 952 1066 2173 1172 1154 689 1829 1694 2251 1478 1138 1505 1058 1224
## [1387] 1292 1160 1298 1057 1337 937 1218 1141 1448 1326 1448 2335 946 2317
## [1401] 790 2693 2114 1343 1805 955 2057 1394 1865 1196 1147 1587 1347 2389
## [1415] 1028 1288 2066 1645 776 1096 4356 2221 1270 1618 1526 1256 1123 1324
## [1429] 1427 1279 1390 1627 1388 908 1271 1256 1514 1197 1237 764 1097 750
## [1443] 1712 1002 1005 1225 893 1099 1547 1183 1105 1138 1654 1138 1022 1682
## [1457] 1301 2154 1402 1323 1798 1071 1853 1098 925 1767 1365 942 1735 1067
## [1471] 1689 1892 808 1204 1309 1312 1247 2822 943 1595 1060 2194 1124 2013
## [1485] 1326 1455 735 902 2283 923 1590 1558 1496 1371 1146 1348 1692 1307
## [1499] 1452 1133 1049 1243 1387 886 961 1010 925 1322 1749 1283 1344 1038
## [1513] 2333 1037 921 1195 2234 1506 889 1801 1143 870 1401 1444 1246 1694
## [1527] 1333 1133 2267 1415 1237 1642 1284 1461 1212 1167 1067 1302 1346 1155
## [1541] 623 1171 1296 1091 1325 1167 1295 1609 722 1754 1038 1494 1638 1259
## [1555] 1111 1441 862 910 1069 1394 692 647 1230 1442 1294 2008 1662 1332
## [1569] 1588 1426 1387 1064 1035 1485 1352 1388 1172 1063 1585 1501 1211 2058
## [1583] 1494 918 966 1567 1164 985 2336 929 980 1291 1563 1243 755 1104
## [1597] 1379 1792 969 2761 1034 1231 1861 1568 1353 1495 1500 790 1820 1552
## [1611] 1181 941 1437 1741 2533 1793 1358 3089 2191 528 1073 1128 1060 1632
## [1625] 1506 1519 1401 1543 1211 911 1298 1497 1110 1417 1596 1103 1635 902
## [1639] 1303 1329 2325 1990 844 2172 1370 1908 1406 1734 1494 1358 1239 2105
## [1653] 759 1404 1226 1047 1250 1026 3182 2576 1179 1050 1280 1430 1946 1357
## [1667] 1535 981 1060 759 1424 1558 1370 2022 993 1078 1158 1463 1148 1263
## [1681] 1255 1372 1803 1321 1132 2021 1232 3640 1409 2492 2790 1292 1532 1178
## [1695] 1266 1296 950 1402 1715 1610 835 1105 1393 1134 1595 1005 1226 1157
## [1709] 1818 1465 1442 1260 1460 2279 1466 1521 1399 948 1360 735 1268 1212
## [1723] 1787 1551 2205 1142 1379 872 1457 2433 1087 1361 1403 2187 995 1044
## [1737] 1903 1263 1528 1623 895 1009 1717 1065 1239 857 999 1433 2100 1530
## [1751] 893 1356 1813 2613 1073 2086 1364 1788 1526 1349 1914 897 1370 1817
## [1765] 858 1264 1632 1124 2083 1369 1398 1169 1044 1081 962 1350 1419 1913
## [1779] 990 1175 2259 1057 1827 2289 1854 1401 1136 1279 1584 1651 1141 1331
## [1793] 1385 1499 2248 1093 1490 1915 1022 1800 1113 866 1155 1146 1317 1429
## [1807] 1332 2014 903 1442 1163 951 1067 1265 2358 1317 1217 1448 1405 2036
## [1821] 2775 1434 893 1034 1102 1381 1147 1744 1405 1081 1433 1129 1131 1575
## [1835] 1524 1072 803 1407 1417 2076 1747 649 959 1162 2699 799 1578 1539
## [1849] 1495 1073 689 2751 3464 3586 1121 1490 1100 1035 1358 1376 1194 1070
## [1863] 1041 1217 1584 1138 1137 946 1200 1392 1385 2427 829 1158 1185 1187
## [1877] 1142 2540 1044 873 1287 1255 1495 956 1099 1373 1436 2820 1293 1601
## [1891] 1264 1240 1355 1367 1192 1621 1764 1400 1057 1488 1796 1445 1340 2060
## [1905] 1053 1182 1125 1325 1546 1093 1856 720 1205 1176 1755 1400 1606 1822
## [1919] 1887 1490 1721 1485 1784 1633 1324 1181 970 1368 1279 502 1563 862
## [1933] 1095 1686 1396 1880 807 1309 1199 802 1011 1153 1494 2349 1336 1168
## [1947] 1379 1206 654 1961 716 1379 2231 1638 1248 1333 1491 1233 1007 2674
## [1961] 2015 2263 1278 1695 1172 1292 1258 1389 1062 1483 1598 1142 816 1428
## [1975] 1244 2965 1045 1313 1357 1770 1314 1225 1261 1481 925 955 1264 1173
## [1989] 947 1351 1689 1081 1120 999 798 1764 1636 1364 2137 1079 1624 1322
## [2003] 965 1106 1696 1249 1171 1169 2433 1003 1127 1651 1869 1410 1441 1419
## [2017] 1979 1148 1145 1384 1271 1333 2613 2053 2262 957 2346 1250 1579 1077
## [2031] 921 1049 1159 1173 1653 1055 1124 2184 769 3288 1216 1507 2858 1772
## [2045] 1322 2783 2197 3260 1365 680 1521 1862 1199 1725 1459 1642 1367 1647
## [2059] 1185 991 1428 1632 1421 1244 1531 1362 1380 1098 1186 1189 1550 1886
## [2073] 1312 835 892 1270 1475 1239 1101 900 743 1595 1007 1363 1417 1447
## [2087] 1403 788 1146 1278 1177 1662 1673 2550 881 1343 1127 1618 1284 1174
## [2101] 930 941 1278 875 1369 653 1892 1258 1289 2287 1534 1365 1284 1355
## [2115] 1599 1680 1479 1223 1066 1442 1123 1576 1480 1426 2259 1218 1036 1120
## [2129] 1360 1144 1344 1328 1101 1159 1600 643 1073 1471 1241 1814 1156 1230
## [2143] 1415 1743 1489 1604 1392 1045 1284 1063 1359 1524 936 1248 1071 1382
## [2157] 2057 2196 1089 724 1494 1904 1610 1472 1007 1287 1652 1793 1437 1230
## [2171] 1565 1702 1170 1747 968 1065 1297 1259 1562 996 3213 1087 1657 1444
## [2185] 1305 1345 1260 1765 1915 1253 1366 1505 1099 1067 1574 1230 1354 1614
## [2199] 1341 1550 1630 1186 1122 1240 2588 1295 1004 1454 856 1130 728 1083
## [2213] 1157 1228 1615 1003 1193 1154 2164 1494 1080 1156 1712 1197 1432 1137
## [2227] 1476 1306 919 1393 763 1526 2309 2277 2567 1187 1158 1078 1041 661
## [2241] 1507 2258 2658 1085 715 1504 969 1452 1355 1586 1371 1600 1057 1368
## [2255] 1436 3074 1256 3033 1075 970 2407 1486 2204 2446 1926 1434 1010 1305
## [2269] 1859 1149 1215 937 1218 2561 1385 953 1054 1356 1424 1100 1103 1445
## [2283] 1215 990 1254 1355 1681 944 895 1123 3089 1288 780 1512 1731 1295
## [2297] 1678 757 2700 1011 818 1418 1031 1293 944 1154 1035 758 1224 1376
## [2311] 1115 1480 918 1395 840 1283 972 2122 1040 1990 1077 912 1430 2051
## [2325] 1049 1287 1032 1126 1261 1241 1208 2071 1997 1152 1366 1225 1769 1157
## [2339] 1559 1048 2329 2720 1078 970 1365 1062 1325 1232 1755 1303 1288 1094
## [2353] 2941 1039 892 2351 1663 1497 1300 863 727 909 1447 3123 1678 1084
## [2367] 1079 1208 1605 1096 1813 1006 1007 1170 1179 1035 1393 1542 682 1345
## [2381] 1022 656 986 1030 1358 1393 1821 1318 1389 4105 1571 712 1121 2222
## [2395] 1434 1610 980 2005 1265 1425 2021 1006 1425 1290 1823 892 879 1113
## [2409] 1780 1234 1275 917 1138 1452 1608 1445 1334 3716 2041 1150 1124 1114
## [2423] 1184 1315 1074 632 1607 895 2214 1271 1297 1448 1336 1375 2249 1188
## [2437] 2549 1434 1117 1017 1475 1492 2098 1793 1252 1354 1049 1052 1357 1162
## [2451] 1048 1040 739 1229 1208 1160 1312 1090 1395 1138 1292 1204 1374 1027
## [2465] 1287 1357 1430 1328 1016 2755 1326 865 1350 1337 1697 1596 814 1360
## [2479] 1402 1725 1475 2201 1399 2464 1094 1174 1421 2169 1456 922 1364 1076
## [2493] 901 1671 1010 1376 1503 1505 1271 1341 2294 1530 1718 1359 862 1547
## [2507] 1296 1517 851 1245 1068 1253 1546 1212 1302 1083 1095 1307 1610 1361
## [2521] 1748 1001 1659 1224 2087 966 1790 1293 1458 1212 1326 1204 1020 1559
## [2535] 1249 1209 1335 3312 2091 1649 1287 1176 1289 1320 715 1711 1182 1170
## [2549] 1033 1210 1429 1517 955 892 1260 1499 1684 1359 1129 1460 1573 1182
## [2563] 1121 983 1692 2193 1631 1105 1360 1198 1406 1713 1316 729 1852 1161
## [2577] 1296 934 1189 1331 1151 970 1261 1374 1306 693 1292 1173 1271 931
## [2591] 1557 993 2150 1196 761 712 1604 1677 1795 2695 1140 1253 1752 1909
## [2605] 1564 1836 890 957 1155 1272 1226 1478 1121 931 1037 1059 1299 1383
## [2619] 1175 1195 1460 1781 1529 2702 1195 1647 2247 659 1159 1081 1731 1063
## [2633] 1723 1237 1020 1828 1388 1785 681 2923 1874 1027 3112 1239 1467 1241
## [2647] 1777 1241 1178 664 1059 1419 1032 1653 1409 1324 935 1269 1950 1621
## [2661] 1223 1569 2069 1523 1343 1200 1119 1130 1462 1289 1150 1234 1052 1197
## [2675] 1153 1149 1545 1107 1867 788 1238 1294 1363 1245 1146 1198 1319 1327
## [2689] 773 1502 1359 1155 1261 1177 932 1400 1029 894 1497 876 989 1104
## [2703] 1414 1177 1986 1444 1425 1184 1370 995 1742 1129 1027 1361 1372 884
## [2717] 1559 1480 1013 1380 1065 1108 1148 1077 1589 1161 1144 1336 1166 1033
## [2731] 1190 1428 1035 1159 2804 1037 1159 1307 2441 1947 2928 1204 3324 1087
## [2745] 1492 1958 1441 2128 1485 957 1465 1427 929 1156 905 1278 2206 1040
## [2759] 1114 1506 1408 1147 1257 948 1395 967 1272 1236 2074 1277 1290 1219
## [2773] 1744 1780 1206 2574 1215 1300 1398 2394 1119 1576 1236 1342 1729 1474
## [2787] 960 1171 1395 1113 1108 1165 901 867 1082 951 2131 790 1222 1408
## [2801] 1736 1080 949 1509 1413 1379 945 1274 1953 1211 1428 1179 1310 1165
## [2815] 898 2035 2059 1522 1077 834 1580 643 1610 1026 2030 1065 1346 969
## [2829] 2165 1757 1193 1297 2434 1380 1453 1419 1276 1612 1660 900 1050 1061
## [2843] 1340 1376 1173 1571 1722 1812 1135 1071 1374 1426 1293 1553 867 2331
## [2857] 1263 1248 1137 1546 1364 3131 960 1093 1338 1425 1018 1455 1245 1252
## [2871] 1165 1654 1200 1215 1055 1686 1446 965 1134 1219 1015 1516 1454 1108
## [2885] 676 1253 1219 1924 3211 1110 1154 1505 1671 1899 1461 1386 936 1662
## [2899] 1272 2184 1167 1390 1529 1525 1167 1390 1776 2630 1762 1394 2410 1373
## [2913] 1150 1173 1333 1432 1442 1195 1149 1283 1094 753 800 1037 1295 1195
## [2927] 1123 1065 1363 1369 1333 1256 958 1482 1008 1274 1404 1430 1271 1748
## [2941] 2209 1313 1106 1514 2358 2274 1651 1451 930 1284 1948 1281 1483 823
## [2955] 2301 1175 1390 1287 1817 884 1368 1285 1732 2314 1759 1486 1116 1212
## [2969] 1944 2463 1588 1909 1132 1653 1208 1295 1513 810 1185 1155 1246 768
## [2983] 1258 936 1380 1128 1296 1330 1331 1693 1591 1530 3107 1015 1201 1341
## [2997] 683 1486 1005 2043 1040 1204 1281 1537 1168 1712 1098 1746 1517 1399
## [3011] 1797 1177 1377 2530 1590 1362 1232 1445 954 1820 1069 1208 1030 1126
## [3025] 1538 2384 1443 1242 1556 1672 1452 3076 1861 1242 4640 1548 1471 1294
## [3039] 1527 1249 1354 1590 1335 1158 1227 1363 1034 1193 1484 2101 854 1788
## [3053] 1413 1198 1235 714 2177 1372 917 1419 1909 1986 1203 2828 833 915
## [3067] 2257 1241 2485 1124 1433 1074 1261 1166 1038 1071 817 1174 2232 1259
## [3081] 1213 1312 2249 2882 762 926 1106 1797 1003 1507 1048 1873 1189 916
## [3095] 1125 1065 1325 1030 853 1303 2044 1556 1166 1339 949 1164 1335 1237
## [3109] 2829 1561 2315 907 1363 1405 1885 1459 3149 1444 785 827 1259 1065
## [3123] 1656 1274 1242 1123 1579 1058 1319 2427 2065 836 1322 1625 1049 1430
## [3137] 1206 1163 1596 1114 886 1355 1405 1044 1124 1174 1593 1399 1657 1426
## [3151] 735 1261 1107 1805 1353 820 1514 1572 1021 1487 1478 1123 1046 1452
## [3165] 1235 2196 964 1455 1085 867 1154 1169 1398 1258 1531 925 1141 885
## [3179] 942 1256 1819 980 1638 1050 945 1297 1271 1219 1417 1529 1295 1180
## [3193] 1237 1357 1211 1311 1563 745 1067 1931 934 879 866 971 1265 766
## [3207] 3088 935 1160 1108 641 1152 1131 1249 836 957 1062 1186 1214 1310
## [3221] 868 1096 1305 1147 1590 1055 1483 1282 1374 2300 1206 1464 1836 1378
## [3235] 1290 1046 774 986 1027 1470 1639 915 1570 1127 1142 978 926 1022
## [3249] 1152 1709 1263 1247 1811 998 1505 1463 1820 1415 1126 1384 977 1316
## [3263] 1408 1599 1594 1223 1371 1435 1002 1215 1488 2159 1316 1972 767 1876
## [3277] 2456 1145 1271 1512 1581 1024 1280 1400 1159 1825 1139 1082 1001 2050
## [3291] 943 1875 2063 1588 974 1306 1169 1803 1192 1654 1398 954 1603 1596
## [3305] 1397 1328 1243 1272 1196 1566 1036 1385 2065 782 1242 1404 1723 870
## [3319] 1431 1232 1159 1045 1243 1450 924 982 1185 1185 1102 1471 1405 1969
## [3333] 2073 1249 1517 1207 725 890 1348 1130 1455 1009 1800 1792 1323 1475
## [3347] 1207 1128 1028 1304 909 1735 1515 1520 1191 2732 1666 1602 1453 1464
## [3361] 939 1192 1170 1367 1572 1249 1172 1465 1294 1071 1053 1452 3282 1074
## [3375] 1592 1288 1358 2835 1200 1210 1815 1213 1063 1089 1512 1152 1344 741
## [3389] 1598 1539 1435 2672 1140 1219 1192 1643 793 1271 1094 1086 1120 1296
## [3403] 984 1202 2272 1221 1170 1236 721 977 1170 1970 1157 1293 995 1210
## [3417] 1913 1476 1151 1125 1150 1598 1786 2257 1685 1427 1505 994 677 1345
## [3431] 1295 934 728 1375 1408 1235 3207 1750 1285 1920 1097 1547 1359 2547
## [3445] 1571 2857 1391 1570 1265 1056 1577 1164 1386 1266 1274 1219 1302 1272
## [3459] 1844 1089 1096 1380 1155 1121 1060 1441 1391 1273 1544 1279 1363 1030
## [3473] 1106 979 1662 1190 1536 1736 957 1947 949 1059 1171 3876 1445 1031
## [3487] 1554 2083 1138 1287 1053 1369 1083 716 1510 1553 1427 1327 994 1069
## [3501] 1398 1254 1222 1126 912 1755 767 925 2008 1198 1308 1168 1392 2903
## [3515] 1583 1760 2132 1263 1302 1750 1229 2354 983 1195 1260 952 1199 948
## [3529] 1238 3325 1022 1597 1752 921 1083 2722 1196 1342 1252 966 1790 2312
## [3543] 1282 1306 992 1140 1150 2495 989 2195 1886 1259 1192 1352 1209 1395
## [3557] 1027 1119 2358 1397 976 1387 2090 2039 1505 767 1215 1114 1150 1483
## [3571] 2059 1243 1448 1041 1273 1438 1305 1587 1144 1091 1748 1294 886 539
## [3585] 837 2239 1636 1030 853 1187 1471 1158 1107 2543 1256 1648 1933 1192
## [3599] 1156 965 972 2927 1112 646 1206 785 1110 2679 1606 1099 1007 1201
## [3613] 965 998 1961 1570 818 990 877 1209 1194 1170 1915 1007 1347 1906
## [3627] 1281 1448 1407 2497 2357 1222 1973 1173 1147 1087 1735 1975 1092 1130
## [3641] 1776 1013 2030 841 1079 1336 1195 1053 1567 1385 1109 2538 1093 1428
## [3655] 1184 1288 882 1127 1916 1379 1215 1365 1537 1357 1478 1267 968 1115
## [3669] 1436 1484 1699 1011 1253 792 2160 1303 1473 1310 1456 1133 2130 1254
## [3683] 1112 1136 1168 1445 934 1063 3709 3583 1124 2379 2130 1383 2050 1279
## [3697] 1090 1276 1408 1199 1179 2460 1398 1179 1961 1353 2357 1619 1686 1662
## [3711] 1156 1659 937 1530 1249 1418 1555 2129 2154 1249 2655 1198 2461 1099
## [3725] 2101 1476 1381 1490 1407 1513 1873 1359 1771 1927 1495 1133 1629 2279
## [3739] 1587 1094 1083 1220 2650 971 1120 1621 1365 1828 964 1356 748 1271
## [3753] 1014 1155 1299 1005 1224 1421 1012 1256 1613 1958 1117 1406 1300 1936
## [3767] 1025 1587 1281 1176 838 1971 1332 968 1417 1503 733 1042 1297 661
## [3781] 1060 1579 1032 1063 1530 702 2311 1258 1206 1506 1285 1977 1206 2648
## [3795] 2244 1439 1235 1769 1418 1483 1250 1177 1155 1235 926 758 1274 1530
## [3809] 1214 1579 1164 3547 1472 1246 996 1075 1499 1254 1458 1959 1034 1563
## [3823] 1539 1403 1420 1143 1235 797 1212 1219 1160 1428 990 1529 1697 1218
## [3837] 1778 1133 1499 1345 1313 870 1117 1760 714 819 1204 2213 1722 1349
## [3851] 1443 1371 1912 2356 1042 1350 874 1413 1880 2472 855 746 747 1293
## [3865] 1019 905 1166 1465 1109 776 1465 1065 683 1196 894 1399 1369 715
## [3879] 1055 2525 1479 1280 1165 1335 1162 1245 1395 1729 1211 1464 1054 1297
## [3893] 1867 1837 926 1835 1097 659 1202 1093 2465 1247 2322 1512 1004 1178
## [3907] 1339 1479 811 2272 1855 1069 976 965 1480 1212 1485 1134 2468 1130
## [3921] 915 1501 1083 1490 1322 1667 3112 1116 793 1032 1258 1445 1292 2186
## [3935] 1312 2107 1025 1039 981 1358 1055 1193 1297 1000 1056 1146 819 2121
## [3949] 1540 1472 1449 1181 1065 1388 1367 1350 1373 2777 1131 1788 836 1530
## [3963] 1222 2895 1354 1172 976 1283 1401 1090 1042 1270 1164 1205 1372 1082
## [3977] 1233 1749 1406 1455 1021 1203 1648 2232 1242 999 1233 1257 1220 1004
## [3991] 1356 1462 1026 1038 1384 1023 1096 3485 1160 1109 1045 1497 1405 3148
## [4005] 1074 1251 1098 1228 1625 1277 1997 1173 855 2884 2146 1333 1381 1158
## [4019] 678 2181 1207 1356 3135 2846 1998 1983 973 1902 1308 1303 1447 878
## [4033] 1119 1480 1235 1208 1152 1342 1307 1201 2218 666 1145 974 1422 1218
## [4047] 1538 1316 1599 1551 1219 1372 1352 1368 1192 1427 1313 3206 1145 1276
## [4061] 1444 1023 1374 1614 941 3826 1330 1019 1420 1444 2104 2164 1066 1054
## [4075] 1239 2131 1445 1369 1156 1244 1170 1683 1318 1561 2262 1406 1147 1156
## [4089] 1353 1180 2685 1935 1314 3149 1233 1549 1436 1310 1039 1918 1357 1163
## [4103] 1685 1072 1926 1008 1001 1412 1163 1367 1424 1753 2495 1731 1071 1059
## [4117] 1150 1263 1118 1289 1398 1044 1823 2331 1225 2391 3414 973 1578 1268
## [4131] 1934 714 1456 1210 1339 1084 1332 989 1003 1442 1808 1441 1341 1113
## [4145] 1539 1027 1347 2454 1323 1667 1088 1173 1264 1330 1196 1488 987 1745
## [4159] 1194 1006 1077 1738 1031 1488 1137 1190 993 1164 1099 1948 1212 1359
## [4173] 1025 2136 1707 1197 819 1266 2347 1333 1525 1429 2678 865 2167 1454
## [4187] 1194 717 1368 1104 1134 1451 1200 1194 1261 1374 1471 1170 2708 757
## [4201] 1096 1227 864 1464 1416 782 1710 826 3299 1626 1305 1220 1062 1547
## [4215] 1258 1213 1250 1179 1113 945 2389 1177 1382 1062 1014 1247 1575 1104
## [4229] 1364 924 699 1667 1321 1268 1183 1364 1218 1544 1173 2310 1887 1174
## [4243] 1556 1193 568 1503 1212 1240 1713 1381 1576 1131 1259 1129 1198 1234
## [4257] 2319 1278 1013 1291 1646 965 1621 1178 1186 998 1178 2007 1320 1372
## [4271] 2498 1209 1558 1462 1191 1013 1362 1093 1533 2065 1073 1241 901 842
## [4285] 1454 1635 1081 979 961 1227 1006 1315 2875 1102 1347 1406 1155 1174
## [4299] 1458 1531 1373 1167 1347 1166 1685 910 1164 1197 1765 1182 2063 1530
## [4313] 2486 1117 1336 1430 1316 1337 1263 1491 2371 1866 1199 1129 1126 1003
## [4327] 1177 2184 1087 1413 830 1445 1918 1401 1858 1140 1210 1243 1565 1494
## [4341] 1237 1870 902 901 1245 1312 1045 1427 1456 927 964 1168 934 3165
## [4355] 2426 1309 1420 2649 1212 1112 991 1099 1452 1592 1180 743 1428 1276
## [4369] 2713 1244 1705 1146 1649 1150 835 1461 1043 1521 1599 1069 1451 1895
## [4383] 1161 1185 2695 1354 3744 1412 1382 1350 2135 1455 1149 1226 753 1506
## [4397] 1442 1104 1557 1060 1007 1407 1358 1402 1115 1200 1744 1077 1308 1088
## [4411] 1211 1752 1115 1487 1688 1230 1387 995 1942 1273 1938 751 1489 1568
## [4425] 1019 1382 1069 1168 1185 1069 1197 1393 1679 1132 1163 1244 1879 872
## [4439] 3072 1542 1549 1618 933 1465 1401 992 1389 1706 608 1364 2613 2283
## [4453] 1350 1559 1787 2027 1260 1013 964 1017 868 2170 1348 2364 1227 1314
## [4467] 1767 1697 983 1739 1688 1503 1923 2383 1544 1267 1476 1439 898 1249
## [4481] 1861 961 1571 2427 1399 1623 1915 1564 2276 1236 730 1305 1555 1100
## [4495] 1200 917 2470 1482 1405 1079 1391 1723 1763 1279 2676 2015 635 1109
## [4509] 770 1064 1000 924 2107 2802 1044 1698 954 2706 1219 2153 1925 1505
## [4523] 1247 1086 905 985 988 1048 1699 1075 1077 1766 1218 1338 1016 792
## [4537] 2670 1188 1569 1100 1188 1271 1623 1337 773 995 1420 2545 1948 1015
## [4551] 1267 1713 1225 1609 1026 1009 1807 1974 1479 1311 2121 1214 885 1159
## [4565] 2059 1087 1289 802 3105 2607 1336 1142 2332 1213 966 1464 2147 2002
## [4579] 1234 1220 944 1095 1591 1373 984 824 2010 1485 1143 1491 2022 1403
## [4593] 1173 1016 2321 732 778 1474 1128 842 905 1239 1631 1215 1396 1191
## [4607] 797 1364 2680 1760 986 1741 1384 1796 1234 1507 1202 1973 1075 1354
## [4621] 1112 1299 1535 1097 1733 1432 1273 1273 901 1482 1706 1476 1335 828
## [4635] 2443 1414 1786 1752 2392 1021 1717 1098 791 889 1214 1306 1015 1523
## [4649] 1124 1190 1560 1214 1982 1561 992 1511 1534 1008 1688 1439 1658 1267
## [4663] 1078 1205 867 1227 904 1459 1493 1031 1113 1140 1573 1193 1253 1297
## [4677] 1134 1085 1162 2687 1182 1005 1217 1226 756 1370 988 1257 1917 1126
## [4691] 1381 1031 2726 2127 1169 1035 1010 2051 1582 1334 1149 1013 1705 1137
## [4705] 1351 1540 1072 1423 1525 1401 948 1635 1190 933 1432 1079 913 1048
## [4719] 1170 2790 1860 1444 1352 2143 1075 1686 1305 1378 1153 1795 1370 1886
## [4733] 1139 1325 930 1072 1718 3423 1399 969 1317 1628 1778 2437 1267 1998
## [4747] 2224 1206 1111 1220 1320 1408 825 1204 1721 924 1511 1219 1336 1393
## [4761] 1092 1274 1627 1541 1126 1852 1343 1445 797 1178 2299 943 1578 1277
## [4775] 1170 1338 1179 1380 1378 1167 1348 1467 791 1010 1407 1256 1452 1127
## [4789] 752 1372 1079 2228 2978 1229 2013 1361 1677 1769 1178 1540 1359 1374
## [4803] 1284 1158 2014 1185 1990 654 1444 1289 1458 1343 1196 1562 2120 1269
## [4817] 3816 2717 1116 1207 1934 1354 1017 971 1147 1148 1227 1026 1474 2802
## [4831] 1463 1068 970 1245 1330 1575 768 1554 909 879 2128 1306 1266 1486
## [4845] 1275 1569 1538 1270 2109 2073 1151 1475 1026 1890 1316 1875 1406 1371
## [4859] 1036 1338 692 2036 1531 1248 1692 2020 1672 1034 1662 1484 1438 1233
## [4873] 3140 1326 999 1503 1091 1628 1511 1466 1529 1098 989 1789 2452 1432
## [4887] 1229 1522 1308 1044 2180 628 1681 1483 1951 1338 961 1365 1178 1468
## [4901] 1217 1296 1230 1111 1599 1407 1039 1245 735 1236 1548 1292 1513 1892
## [4915] 1178 941 1040 1992 1286 1027 929 1433 1144 1445 977 1080 1524 788
## [4929] 1377 1577 2078 1205 2483 823 2860 1904 2925 1414 1326 1232 931 1406
## [4943] 2225 1482 1396 1152 1736 1147 1971 1583 1191 998 1523 1236 1245 2226
## [4957] 1454 1036 936 1050 1070 1090 1085 878 1248 1202 1205 1040 1885 1282
## [4971] 1131 1362 671 1346 711 1075 1560 2740 1323 1259 3827 2252 990 1660
## [4985] 1855 1180 1282 1194 1485 1065 1189 1317 1199 1213 1042 2405 1376 1148
## [4999] 1113 2156 990 2351 1135 1656 1173 1189 4800 1413 1236 1062 1653 2434
## [5013] 1025 1343 1058 1257 1132 2594 720 1473 1289 2000 1604 1347 1366 844
## [5027] 1295 1678 1141 1572 1214 1206 1228 1146 1223 1566 1027 1437 2674 2503
## [5041] 1348 1123 1269 767 1270 1321 1046 1262 1358 1581 839 2177 968 1257
## [5055] 1030 1135 1129 1153 1293 1385 950 1384 1064 1606 1009 1153 1240 979
## [5069] 1797 1057 1147 1680 1388 1083 1682 1124 1693 2230 2015 1234 1141 3238
## [5083] 1298 1059 1596 1343 1630 1286 1214 1172 1323 2005 1259 1529 1412 1386
## [5097] 1170 970 1124 1332 1404 1212 2276 1528 960 2364 1275 1231 1294 1197
## [5111] 1536 1689 1537 1109 1047 1775 1141 1337 1521 1149 1056 1071 1186 1932
## [5125] 1095 1253 1377 1514 1202 1111 1558 1191 1463 1928 1774 1338 1748 988
## [5139] 1896 1749 2455 2123 3717 1033 1220 1062 1623 1340 998 1278 997 1023
## [5153] 1046 1101 1113 1053 1090 1434 780 1426 1020 718 1173 927 1178 982
## [5167] 1165 1084 1207 1681 2459 1504 1228 1271 1502 1161 963 1586 1296 1579
## [5181] 1356 809 2283 1777 980 1116 2907 1228 877 1260 909 2214 1191 1289
## [5195] 1156 1332 1133 935 1138 1177 1447 1198 1412 1343 1421 1132 2981 1439
## [5209] 1239 1212 1036 1516 1569 858 849 1198 1318 1368 1851 1605 1610 1425
## [5223] 1693 1163 1623 2525 698 1219 1631 1418 1367 1643 1265 1166 1142 1379
## [5237] 953 1684 1865 1545 1540 1178 796 1571 1238 1697 2233 681 3004 1423
## [5251] 2137 1692 1356 1325 1064 1341 1016 1401 1051 896 1248 1167 1293 1038
## [5265] 2087 989 1312 1203 1542 2214 889 2343 1093 824 974 1736 1459 1270
## [5279] 1793 1828 2904 1246 1074 1612 1855 1146 1266 1445 1381 1346 1187 1563
## [5293] 3135 1648 1139 845 1006 1316 1185 1087 1399 887 1313 1351 1698 1200
## [5307] 1169 1422 1293 1337 1350 1414 2353 2071 1101 2073 2514 1065 1870 2995
## [5321] 1125 1194 1170 1190 1204 1382 1904 925 1377 1379 1419 1305 1156 847
## [5335] 1390 1036 1012 1531 1382 1193 809 1198 1176 1946 1024 1291 1001 1271
## [5349] 1789 1429 1255 1533 1510 2934 2032 1463 1393 728 1011 1513 2279 1070
## [5363] 1014 1389 1055 1771 2136 1214 1330 2113 1065 969 1710 1263 1208 1319
## [5377] 2043 2212 1069 1379 906 1456 1760 1206 1122 1098 1468 843 1151 1055
## [5391] 1331 1310 1816 1253 1109 1122 1477 1184 1434 1258 964 2101 1185 1195
## [5405] 1093 2381 934 1750 2674 1395 1148 1502 1648 1234 1499 1073 1597 1152
## [5419] 1442 934 1071 1350 1589 1483 1155 1344 1187 1631 1035 1381 3193 814
## [5433] 1259 1635 1636 1173 1023 1404 1428 752 904 1237 895 1216 1095 1274
## [5447] 1372 1208 1283 1033 1315 1555 2744 1107 1133 1187 1625 1264 1393 1259
## [5461] 1478 1300 1226 1108 1286 1192 1080 1139 1040 1400 1198 1203 1128 870
## [5475] 788 2712 1584 2577 2571 1165 1094 953 1711 1372 2168 1611 964 1013
## [5489] 1041 1393 1491 980 1555 1271 1376 1308 987 2109 1937 1021 950 1115
## [5503] 756 1357 1411 1488 1127 881 1196 983 944 1360 1045 1169 773 3107
## [5517] 1543 723 1224 1450 1381 1127 1096 1145 1295 1096 1155 996 1326 1432
## [5531] 1149 1456 1163 1550 1253 1076 1559 861 1748 1268 1245 1107 972 1279
## [5545] 1221 1177 2229 1295 834 1255 2271 2243 1092 1104 1633 1964 3245 1393
## [5559] 1692 2480 2185 1196 1955 996 1712 1402 1143 1144 1324 1101 833 1172
## [5573] 1254 1094 849 1786 1591 1358 1382 2272 1050 1068 1103 1419 2089 1059
## [5587] 973 1172 1276 1388 3343 828 2506 1094 1423 1003 965 1650 1324 2675
## [5601] 1167 1512 2041 1454 946 1948 1968 1056 1389 1108 1234 2316 1852 918
## [5615] 1791 1233 981 1175 2172 874 884 932 961 933 884 1367 802 2247
## [5629] 1049 1068 859 961 848 1534 665 1274 1571 1608 1271 1001 920 3987
## [5643] 1020 888 1129 1518 1200 1496 1223 1489 1361 1313 1950 1574 3165 1311
## [5657] 1314 1216 1867 1486 1277 995 790 1067 1258 1569 1466 1563 1116 1388
## [5671] 2171 1028 1150 1122 1437 1692 1344 1481 1250 1410 691 2373 3091 1276
## [5685] 733 1050 1161 1239 1971 1514 1241 1072 1445 1182 980 1441 963 1773
## [5699] 2763 1154 1822 1088 1349 1374 1105 1476 1294 1018 1549 1656 1510 2026
## [5713] 1846 706 1015 1694 797 1194 944 1278 865 1154 1840 908 1100 1354
## [5727] 999 1735 1275 960 1372 1036 1125 1845 2356 1283 1281 1034 1264 1494
## [5741] 632 1362 1274 1240 1548 919 1393 1137 1051 1217 1488 1237 898 1153
## [5755] 1031 990 1111 1076 1283 1838 1676 1769 1502 1463 1471 1049 1219 1138
## [5769] 1214 1241 1377 740 1343 3268 958 1111 922 1143 1493 1443 913 1057
## [5783] 1410 995 1374 1017 1065 1012 2065 953 1509 1101 842 2186 792 2378
## [5797] 1501 1698 1031 1370 1212 1278 1949 1768 1037 2639 1392 837 1583 863
## [5811] 721 913 1485 1157 1103 1314 1580 1349 1258 1266 1063 1104 1505 1193
## [5825] 1463 1382 1022 1558 1367 1019 2359 1275 1603 2253 1164 1128 2275 852
## [5839] 1201 1137 1191 1401 1042 1489 2012 1301 1254 2583 742 1086 1447 1306
## [5853] 1387 975 1020 1006 751 945 1356 1364 1813 1575 1313 2338 818 2233
## [5867] 1343 2264 1441 1383 1004 1340 1297 1033 1611 2098 1104 1221 963 970
## [5881] 876 994 1181 758 1147 993 1476 953 1778 1122 1035 1481 2898 3064
## [5895] 1086 1797 1536 3241 1171 927 1223 1157 1424 1337 1303 1329 1382 1586
## [5909] 1152 1026 1091 1247 2299 967 1249 1488 1369 1129 1308 1198 1872 1535
## [5923] 1209 1448 1308 1207 1342 1636 858 2313 1336 1394 875 1399 1298 1172
## [5937] 1206 2155 850 1080 1612 1520 1013 1663 882 923 1430 1273 1101 1459
## [5951] 969 1159 1594 1358 1505 1536 773 1485 1243 1214 1395 962 892 1214
## [5965] 1272 1348 945 1157 2706 962 1698 986 1295 1190 629 1081 1252 913
## [5979] 824 1327 1119 1250 1079 1542 885 1177 899 1693 1420 1130 1813 889
## [5993] 1267 1482 1262 990 965 1584 1370 1261 1267 1518 1381 1057 1431 1175
## [6007] 1237 1139 1636 1014 1299 1011 1763 1479 1480 1260 854 1121 1306 1606
## [6021] 1511 1577 978 1194 1056 1208 1317 1092 1518 1669 1237 1631 1090 1147
## [6035] 906 2054 1537 2389 2046 1139 1245 3041 2608 1132 930 1761 1253 1545
## [6049] 1266 1199 993 1465 1503 2002 1191 1374 1260 1367 797 1392 1305 1464
## [6063] 926 1509 1216 1406 1227 1392 865 1156 1324 1047 1485 1339 1131 1375
## [6077] 1449 1206 1235 1407 1048 1004 1912 1531 1145 1255 1076 1377 1088 1415
## [6091] 1252 1050 1603 1996 948 652 1073 1224 1180 842 1130 1009 1153 1179
## [6105] 1426 1026 1781 1204 1067 1260 1228 1241 1135 1165 1845 1115 1212 967
## [6119] 1015 1340 1023 1370 1793 811 1364 970 860 1602 1030 1455 1012 1151
## [6133] 1296 1474 1276 1118 1088 1396 1205 2419 1097 1152 2134 1035 1110 2190
## [6147] 1018 1299 1169 1202 2894 1726 739 890 1219 1365 1317 1305 1227 1226
## [6161] 1668 1058 1381 841 1114 969 1047 987 950 1098 956 1037 2182 1169
## [6175] 1302 2116 1292 1226 1329 1484 1095 1836 1375 1016 1350 1922 967 2784
## [6189] 1425 1275 1359 2385 1379 1141 1165 1244 793 1280 932 1243 1070 1377
## [6203] 907 1179 979 1228 1518 1643 782 1675 1465 1034 1308 1480 1507 966
## [6217] 1272 1658 1699 1196 1899 1324 976 1476 1477 1127 1288 1409 2178 1644
## [6231] 976 1214 1126 1320 976 1770 1515 1747 1123 639 1331 1262 1070 1093
## [6245] 1181 2689 723 1171 928 1034 2166 1294 2037 1488 1332 2953 1555 2049
## [6259] 1443 1292 1122 997 1214 922 1210 1408 1190 961 815 1390 1607 1034
## [6273] 1887 679 1420 3726 1212 1031 1474 1252 1062 1409 1113 1549 923 1202
## [6287] 1226 1600 1156 1101 1482 1864 1093 2601 1114 858 1177 2112 1464 1082
## [6301] 1155 1480 1503 1698 1130 1376 1307 1737 1451 1046 1902 1332 1528 1033
## [6315] 1007 1088 1301 1288 1098 1251 1220 2951 2119 1342 1160 1488 1138 2603
## [6329] 1129 1199 1305 1806 1340 1559 1269 1196 893 1392 1177 1439 864 938
## [6343] 1833 1025 1336 2715 1174 1226 994 1199 1177 3965 1065 1306 2240 1490
## [6357] 932 1530 1243 1139 1309 2492 1595 1277 1178 1127 1682 2148 1123 1394
## [6371] 1182 968 1363 900 1518 1357 1237 1265 1036 1976 1411 996 1986 1098
## [6385] 1420 1168 1600 1232 1144 1342 4157 1698 1403 1287 1548 1242 2443 2880
## [6399] 1572 1330 1210 1141 1323 1534 904 2205 1426 2606 2058 1157 1127 817
## [6413] 1510 1033 1379 752 2191 979 1078 3178 1354 1382 1483 1542 2723 1149
## [6427] 1769 1527 723 1180 1143 1008 1647 1276 2995 1377 1562 3184 2149 2338
## [6441] 1232 955 2088 1103 1282 1500 656 1495 1516 1169 1091 1236 836 1250
## [6455] 1118 1308 1294 2312 1314 2053 898 1586 1306 1024 1399 2340 1364 1505
## [6469] 1080 811 1432 1407 1306 1846 1972 1840 1569 1102 1223 1477 1591 1256
## [6483] 2643 1897 709 1101 1933 1074 1946 1161 1269 1152 1315 1160 1394 1573
## [6497] 1591 1769 1274 979 1356 1509 598 1584 1215 1108 1551 1449 1289 1180
## [6511] 1169 1212 951 1344 2393 1268 1498 1070 1045 1984 1633 1473 1320 1497
## [6525] 961 1270 1057 1085 1007 1690 1559 1412 1560 1323 1135 1666 938 1199
## [6539] 1575 1305 1206 1857 1259 1697 1367 2104 880 1155 1271 768 1458 1237
## [6553] 825 1435 1429 1063 1237 1313 1296 1802 1667 1301 1968 1131 1448 792
## [6567] 1286 1048 1232 1051 1011 1050 1337 1839 2620 1988 1896 1242 1168 1159
## [6581] 1659 1006 1533 1004 764 1486 2008 1422 2271 2185 1787 1454 2269 1277
## [6595] 1406 1804 1109 1054 1307 1303 1057 1672 1109 1081 1829 1027 1492 1283
## [6609] 1940 1017 1863 944 1401 1378 1499 1103 1432 1968 1577 1303 2965 3084
## [6623] 1143 1007 2258 1111 1749 1312 1529 945 1341 1280 1542 1348 1458 1161
## [6637] 2949 1049 1531 1135 2024 2205 2094 3084 1354 1097 1568 1322 1648 780
## [6651] 1291 1338 1026 1086 1116 953 1165 1286 1109 2162 987 1717 1183 1508
## [6665] 1540 1168 1011 1286 1968 1160 1179 1116 1175 1606 1366 1458 1060 1243
## [6679] 1789 2017 2123 1387 1585 1173 1321 1467 1073 1201 1154 1236 1791 1135
## [6693] 2527 1742 812 2152 1236 1382 1428 1016 1182 1245 1116 1048 1332 1689
## [6707] 1189 1668 1594 1340 1329 1173 981 1388 1374 1638 1302 1162 1227 1851
## [6721] 2317 954 1424 1250 1469 1142 838 2591 1404 3069 2222 1579 1018 833
## [6735] 947 1511 1673 927 1062 1196 1073 1298 1087 2399 2508 1378 2218 2017
## [6749] 1593 1201 1335 1220 1311 1388 979 3002 1479 1985 1339 3496 1782 1066
## [6763] 1174 2516 1423 1154 1208 956 1143 1207 1168 1215 786 594 1100 1218
## [6777] 1421 1389 828 1579 1377 1140 1812 1452 1135 2062 1175 1737 1128 1684
## [6791] 1813 994 965 1118 3456 1565 902 1340 2241 2957 1212 2152 650 1216
## [6805] 1318 2333 1163 1563 1774 1334 931 1828 1491 1499 1213 1239 1133 1051
## [6819] 1430 1011 1230 964 1198 1112 1310 1261 1367 917 2331 1098 1038 700
## [6833] 1376 2660 1195 1296 1130 2651 1692 1683 1171 986 1198 1260 1292 1510
## [6847] 1530 1031 1443 1528 1401 1443 1664 1178 1259 3436 1596 1088 1378 1351
## [6861] 1424 2492 1415 1511 1276 1199 1166 2003 1440 1635 1355 1258 3695 1358
## [6875] 1477 1209 1330 2279 1286 1309 2208 1411 1433 1805 1001 1685 1190 1227
## [6889] 1298 1738 1622 1674 1363 1068 1802 1303 911 2278 1432 1129 2283 1388
## [6903] 1719 1442 1135 1448 1422 1628 2330 1375 1831 1323 1532 1138 1565 1099
## [6917] 1066 1033 1380 1458 1197 1455 1944 1233 2043 1181 1866 749 1595 1057
## [6931] 1014 1410 1786 1457 1287 1244 1328 1073 2410 1034 2098 2183 1171 1322
## [6945] 1413 1056 1434 1441 2791 2674 1340 956 1473 798 1017 1447 1509 887
## [6959] 1096 931 1465 1078 1312 1124 1330 927 1145 3329 982 1176 1561 1334
## [6973] 1613 1257 1580 1415 2580 1262 1014 2561 1226 1303 1358 1221 1741 687
## [6987] 743 1375 961 1382 1415 2592 1388 932 978 1195 1225 1590 1186 1130
## [7001] 1328 2128 1489 1208 1345 2040 1110 1495 1157 720 971 1507 1138 1416
## [7015] 1183 1146 1518 3224 1261 1157 768 1366 1419 1047 1410 1847 1006 2015
## [7029] 966 1258 1526 1202 984 1262 1421 1118 1081 1490 801 1683 1240 1565
## [7043] 1226 1515 1067 1149 1467 1567 1220 1005 933 1036 3059 1150 1460 764
## [7057] 1303 1838 1249 1145 1702 825 1592 1413 590 2301 1179 1046 1304 2615
## [7071] 1344 1032 961 645 1726 1134 1468 1446 1626 1825 1157 828 1402 1627
## [7085] 1428 1038 1163 1169 1718 1349 1517 1211 1093 1719 1845 1908 871 3176
## [7099] 1746 1240 2726 1273 1055 1372 1564 1230 1078 1379 2381 2686 1780 770
## [7113] 1022 1507 1271 1234 1422 1208 887 1347 1184 3103 1319 1137 1271 916
## [7127] 1205 1402 1714 1607 1465 648 1185 1895 1148 1214 1196 2024 1528 1196
## [7141] 1263 1372 1293 977 1286 1356 1279 1205 1414 1461 1365 1616 1247 1471
## [7155] 1125 1528 1150 1572 1072 1464 1003 1310 1323 1435 1252 1310 1167 1275
## [7169] 1019 1296 701 1991 1545 1172 1721 1215 1265 1505 1361 1029 1206 1420
## [7183] 2231 1313 1878 1315 1645 2667 1199 888 1527 1557 1411 1469 1326 1196
## [7197] 1284 1394 1180 960 842 1122 1993 985 968 1281 1413 1196 1459 1663
## [7211] 2145 1281 1552 1538 949 1744 979 1646 1000 1509 1165 1370 1808 1145
## [7225] 1311 1218 961 1297 1441 1322 1008 1142 1812 1384 1153 1106 2097 830
## [7239] 2339 1225 2346 2366 1452 805 1052 1322 1282 1606 1611 1291 1474 1044
## [7253] 1466 1292 1388 968 1114 1912 1485 908 1362 901 1477 946 1185 1055
## [7267] 794 1048 1793 1324 1180 1151 2101 1761 1053 1150 1319 1671 1729 1332
## [7281] 2389 1207 1266 666 1694 1421 1176 2215 1198 1419 1121 1225 1631 1257
## [7295] 1923 1434 2300 935 1456 1325 1274 1253 1072 1418 1173 1682 1320 1539
## [7309] 1466 1052 1383 1879 899 1433 1868 1453 1291 1469 1579 1084 913 1544
## [7323] 1252 1524 1284 2301 1444 2720 1177 998 1864 1119 1179 2157 990 1063
## [7337] 953 1696 1896 1610 964 3388 1391 1161 2187 1562 1539 983 1602 1226
## [7351] 1111 1576 1154 2026 1250 1609 751 1264 1269 1474 1110 1478 1257 1257
## [7365] 2453 1518 2207 1247 1708 914 1239 882 1139 856 1512 1564 1153 1770
## [7379] 1400 1407 1264 1714 1187 908 2335 1119 1827 2290 1503 1126 747 2898
## [7393] 1577 2707 1168 1363 1136 3281 1862 1870 1147 1032 2614 3080 1636 1276
## [7407] 1077 1727 1410 826 1503 2079 2403 2400 1522 977 1480 1138 1079 870
## [7421] 1244 1218 690 1511 2363 2009 2035 1876 1813 1430 1475 1966 1151 981
## [7435] 1039 971 1477 1456 1263 2302 1503 2338 1245 1344 1129 2177 1356 883
## [7449] 1620 1497 1388 1103 1610 1486 981 1415 1375 1376 1157 1268 2032 987
## [7463] 1070 1229 1218 2531 1094 1067 884 1018 1294 2103 2604 923 1329 3324
## [7477] 1306 1440 1220 1157 1319 1064 809 1390 1399 1132 1535 1167 1030 2105
## [7491] 778 1374 1808 1611 1425 995 1037 1009 899 1219 2322 1387 1386 1216
## [7505] 1514 1412 1461 1142 2750 1667 1197 1326 1249 1876 1421 1507 1708 1185
## [7519] 1202 997 1351 1303 1099 1335 1986 1046 1332 1134 1174 1004 1241 894
## [7533] 1662 2148 1286 1635 1689 1235 848 1010 1266 1223 1669 2249 1558 2389
## [7547] 1007 797 1241 966 1739 1098 1497 926 1091 814 1303 1359 748 1146
## [7561] 2647 2233 1171 1032 701 1335 909 1553 1634 1339 1278 1242 1086 847
## [7575] 771 2529 2213 2560 1430 1969 1273 1299 989 1123 1084 2206 1979 2041
## [7589] 1038 742 1073 1100 1189 1557 1086 1475 1333 1126 1045 1072 1168 1782
## [7603] 2239 3903 1489 1781 1338 1855 1765 1394 1138 1624 1261 1540 1492 2156
## [7617] 1645 2262 1713 998 1614 1310 1472 1974 1356 1389 1770 1301 978 1478
## [7631] 2184 2072 698 1139 1171 1256 1518 1374 1491 1612 720 1470 1233 1822
## [7645] 3152 1142 1527 1436 872 1419 963 2196 3368 1578 1164 839 1305 2128
## [7659] 1863 2994 1461 1327 1349 1155 1348 949 855 1327 1049 2634 1386 1289
## [7673] 1522 1706 2017 1231 2149 1034 1184 1110 1022 3690 996 1435 1513 1608
## [7687] 1349 2625 898 1169 1497 1157 1387 1121 2467 1205 1378 1157 1192 1641
## [7701] 1341 1331 1335 1263 1140 1260 1470 1170 999 1695 1334 1237 1911 1837
## [7715] 1186 1469 3659 1001 1095 1782 1159 878 3021 1135 1062 551 1140 1299
## [7729] 1630 1498 995 2084 941 2004 1371 1476 1585 1351 1054 1180 1130 1078
## [7743] 1063 1218 2002 1563 814 1019 1859 2373 994 1318 1010 1441 1210 1365
## [7757] 903 1272 1526 1829 3729 1639 1580 1216 2049 1114 901 1224 816 1368
## [7771] 1377 1597 2137 2065 1580 1564 1500 1096 982 1001 886 1232 922 656
## [7785] 1201 1518 1110 974 1490 1079 759 710 1038 1144 1254 866 1435 1489
## [7799] 1053 1068 1459 1244 1477 1096 1346 1401 1201 1128 1000 891 1960 1336
## [7813] 1011 1010 2055 1235 1600 2169 1205 1129 1579 1072 1164 887 1308 1270
## [7827] 1265 1388 841 765 1629 1899 1327 1314 1172 852 1105 1195 1102 2223
## [7841] 1133 731 2083 1503 1996 906 1449 2990 607 1397 1472 1750 943 1040
## [7855] 1356 1119 1341 1542 1104 1104 1218 1393 1111 2364 1034 1252 1547 1069
## [7869] 1330 1268 1007 1047 3791 884 1676 989 1423 1117 1011 1522 885 1220
## [7883] 1000 1286 1167 1072 1270 1808 1231 1374 966 1327 1598 2290 1122 1542
## [7897] 1154 1190 1053 1745 1162 1067 961 1603 1041 1380 1681 1377 1042 1186
## [7911] 1171 1070 2332 1030 945 982 1465 1354 2584 1078 959 1309 1080 2154
## [7925] 1333 992 951 1180 1292 1422 1867 1080 1663 1394 1348 1058 1212 977
## [7939] 1442 1320 2357 1447 1175 2562 999 1603 1284 1271 1597 1038 811 1261
## [7953] 2314 1354 1120 2996 2077 1124 1202 1337 1161 1564 1041 1916 1388 648
## [7967] 1026 1615 2649 2018 1177 1030 1336 1915 1249 1114 781 943 1259 1000
## [7981] 1140 1409 1343 1806 2347 1112 1394 1285 1001 2107 3071 1131 900 1407
## [7995] 980 3453 1532 1243 2363 1005 988 922 1489 1095 1614 1578 2100 1330
## [8009] 1451 1610 850 1458 1122 1591 1386 1007 1247 1249 1936 1452 1638 951
## [8023] 949 1414 1134 2637 1292 695 1317 1137 1516 1700 1404 1196 1063 1565
## [8037] 1444 1143 1458 1060 1373 1421 1301 1358 1204 1443 1871 1542 1503 1172
## [8051] 1435 1254 751 1180 1092 1406 1352 1463 1310 1947 2984 850 2227 1264
## [8065] 1028 1123 1106 1160 1649 2128 1030 1625 2281 1091 1482 1233 886 1074
## [8079] 1106 1172 1441 1362 1076 833 1316 1079 1199 1119 1851 1717 1040 1229
## [8093] 1631 1070 1167 1289 1327 1488 1353 1053 1476 1476 1412 2381 1150 2167
## [8107] 1658 2881 876 881 2288 1414 1065 2094 1126 1084 1229 1296 1013 1132
## [8121] 1117 2087 1596 1375 1043 1894 1075 1384 1005 2024 1115 1388 1210 1355
## [8135] 2129 1475 959 688 1623 895 1204 1078 878 1275 1112 1541 896 1376
## [8149] 1663 1069 1381 1085 1446 1113 1551 1214 1408 1346 1136 1688 1681 855
## [8163] 1206 1235 927 1907 1072 2604 1187 1424 1260 1145 1148 1036 2435 1265
## [8177] 1304 1729 1130 1284 978 1200 1017 2474 1444 866 1496 1487 793 1499
## [8191] 1185 1420 1428 910 1002 1160 1418 1037 1251 1204 1050 1082 1377 1156
## [8205] 1322 1394 1215 1288 2836 1180 1215 1635 1201 1640 1963 1452 764 943
## [8219] 963 2116 1011 1387 1296 3345 1316 1596 1292 1101 1255 1299 1641 896
## [8233] 1104 978 1004 1206 1488 1268 1238 1282 2287 1043 1310 1441 1190 1864
## [8247] 2508 1543 1239 2335 1368 1567 784 1248 1525 835 1218 2779 1460 1469
## [8261] 976 1748 994 1590 1210 1119 978 875 1405 1255 1335 1450 2073 1112
## [8275] 1174 664 1060 796 1220 2138 1518 1902 1338 1446 777 1219 1467 1065
## [8289] 1515 899 1883 905 1325 1463 1194 1193 1429 1401 1215 1285 1491 1965
## [8303] 1204 1968 1188 1043 1042 1079 2976 1945 2110 2283 1597 971 1201 1419
## [8317] 1393 1296 1389 1157 1537 1835 1295 1867 1268 1200 1477 1290 1327 1126
## [8331] 1566 935 1195 1646 1084 1228 1370 1052 984 2047 1228 1215 1126 1257
## [8345] 1021 2302 957 1376 1436 1284 989 1502 1416 1215 1735 1250 946 2611
## [8359] 2007 1178 1127 858 1240 1158 801 2552 1403 1095 720 1118 2208 1589
## [8373] 1218 1062 1268 1436 1176 1379 1802 1317 1259
By contrast, if most of the elements are nonzero, then the matrix is considered dense. The number of zero-valued elements divided by the total number of elements (\(\textrm{m}\times \textrm{n}\)) is the measure called sparsity of the matrix. Using those definitions, a matrix will be sparse when its sparsity is greater than 0.5. It’s part of Matrix library.
exprs(my_cds)[1:100, 1:5]
## 100 x 5 sparse Matrix of class "dgTMatrix"
## AAACCTGAGCATCATC-1 AAACCTGAGCTAACTC-1 AAACCTGAGCTAGTGG-1
## ENSG00000243485 . . .
## ENSG00000237613 . . .
## ENSG00000186092 . . .
## ENSG00000238009 . . .
## ENSG00000239945 . . .
## ENSG00000239906 . . .
## ENSG00000241599 . . .
## ENSG00000279928 . . .
## ENSG00000279457 . . .
## ENSG00000228463 . . .
## ENSG00000236743 . . .
## ENSG00000236601 . . .
## ENSG00000237094 . . .
## ENSG00000278566 . . .
## ENSG00000230021 . . .
## ENSG00000235146 . . .
## ENSG00000273547 . . .
## ENSG00000229905 . . .
## ENSG00000237491 . . .
## ENSG00000177757 . . .
## ENSG00000225880 . . .
## ENSG00000230368 . . .
## ENSG00000272438 . . .
## ENSG00000230699 . . .
## ENSG00000241180 . . .
## ENSG00000223764 . . .
## ENSG00000187634 . . .
## ENSG00000188976 . . .
## ENSG00000187961 . . .
## ENSG00000187583 . . .
## ENSG00000187642 . . .
## ENSG00000272512 . . .
## ENSG00000188290 . . .
## ENSG00000187608 1 . 1
## ENSG00000224969 . . .
## ENSG00000188157 . . .
## ENSG00000273443 . . .
## ENSG00000237330 . . .
## ENSG00000131591 . . .
## ENSG00000223823 . . .
## ENSG00000272141 . . .
## ENSG00000205231 . . .
## ENSG00000162571 . . .
## ENSG00000186891 . . .
## ENSG00000186827 . . 4
## ENSG00000078808 1 . .
## ENSG00000176022 . . .
## ENSG00000184163 . . .
## ENSG00000260179 . . .
## ENSG00000160087 . . .
## ENSG00000230415 . . .
## ENSG00000162572 . . .
## ENSG00000131584 . . .
## ENSG00000169972 . . .
## ENSG00000127054 . . .
## ENSG00000224051 . . .
## ENSG00000169962 . . .
## ENSG00000107404 . . .
## ENSG00000162576 . . .
## ENSG00000175756 1 . 2
## ENSG00000221978 . . .
## ENSG00000242485 . . 1
## ENSG00000272455 . . .
## ENSG00000235098 . . .
## ENSG00000225905 . . .
## ENSG00000205116 . . .
## ENSG00000225285 . . .
## ENSG00000179403 . . .
## ENSG00000215915 . . .
## ENSG00000160072 . . .
## ENSG00000197785 . . .
## ENSG00000205090 . . .
## ENSG00000160075 . . 1
## ENSG00000215014 . . .
## ENSG00000279244 . . .
## ENSG00000228594 . . .
## ENSG00000272106 . . .
## ENSG00000197530 . . .
## ENSG00000189409 . . .
## ENSG00000248333 . . .
## ENSG00000272004 . . .
## ENSG00000189339 . . .
## ENSG00000269737 . . .
## ENSG00000008128 . . .
## ENSG00000215790 . . .
## ENSG00000008130 . . 1
## ENSG00000078369 . . .
## ENSG00000231050 . . .
## ENSG00000169885 . . .
## ENSG00000178821 . . .
## ENSG00000142609 . . .
## ENSG00000233542 . . .
## ENSG00000187730 . . .
## ENSG00000226969 . . .
## ENSG00000067606 . . .
## ENSG00000271806 . . .
## ENSG00000182873 . . .
## ENSG00000162585 . . .
## ENSG00000234396 . . .
## ENSG00000157933 . . 1
## AAACCTGCACATTAGC-1 AAACCTGCACTGTTAG-1
## ENSG00000243485 . .
## ENSG00000237613 . .
## ENSG00000186092 . .
## ENSG00000238009 . .
## ENSG00000239945 . .
## ENSG00000239906 . .
## ENSG00000241599 . .
## ENSG00000279928 . .
## ENSG00000279457 . .
## ENSG00000228463 . .
## ENSG00000236743 . .
## ENSG00000236601 . .
## ENSG00000237094 . .
## ENSG00000278566 . .
## ENSG00000230021 . .
## ENSG00000235146 . .
## ENSG00000273547 . .
## ENSG00000229905 . .
## ENSG00000237491 . .
## ENSG00000177757 . .
## ENSG00000225880 . .
## ENSG00000230368 . .
## ENSG00000272438 . .
## ENSG00000230699 . .
## ENSG00000241180 . .
## ENSG00000223764 . .
## ENSG00000187634 . .
## ENSG00000188976 . .
## ENSG00000187961 . .
## ENSG00000187583 . .
## ENSG00000187642 . .
## ENSG00000272512 . .
## ENSG00000188290 . .
## ENSG00000187608 . .
## ENSG00000224969 . .
## ENSG00000188157 . .
## ENSG00000273443 . .
## ENSG00000237330 . .
## ENSG00000131591 . .
## ENSG00000223823 . .
## ENSG00000272141 . .
## ENSG00000205231 . .
## ENSG00000162571 . .
## ENSG00000186891 . .
## ENSG00000186827 . .
## ENSG00000078808 1 .
## ENSG00000176022 . .
## ENSG00000184163 . .
## ENSG00000260179 . .
## ENSG00000160087 1 .
## ENSG00000230415 . .
## ENSG00000162572 . .
## ENSG00000131584 . .
## ENSG00000169972 . .
## ENSG00000127054 . .
## ENSG00000224051 . .
## ENSG00000169962 . .
## ENSG00000107404 . .
## ENSG00000162576 . .
## ENSG00000175756 . 1
## ENSG00000221978 . .
## ENSG00000242485 . .
## ENSG00000272455 . .
## ENSG00000235098 . .
## ENSG00000225905 . .
## ENSG00000205116 . .
## ENSG00000225285 . .
## ENSG00000179403 . .
## ENSG00000215915 . .
## ENSG00000160072 . .
## ENSG00000197785 . .
## ENSG00000205090 . .
## ENSG00000160075 . 1
## ENSG00000215014 . .
## ENSG00000279244 . .
## ENSG00000228594 . .
## ENSG00000272106 . .
## ENSG00000197530 . .
## ENSG00000189409 . .
## ENSG00000248333 . .
## ENSG00000272004 . .
## ENSG00000189339 . .
## ENSG00000269737 . .
## ENSG00000008128 . .
## ENSG00000215790 . .
## ENSG00000008130 . .
## ENSG00000078369 . .
## ENSG00000231050 . .
## ENSG00000169885 . .
## ENSG00000178821 . .
## ENSG00000142609 . .
## ENSG00000233542 . .
## ENSG00000187730 . .
## ENSG00000226969 . .
## ENSG00000067606 . .
## ENSG00000271806 . .
## ENSG00000182873 . .
## ENSG00000162585 . .
## ENSG00000234396 . .
## ENSG00000157933 . .
head(Matrix::colSums(exprs(my_cds)))
## AAACCTGAGCATCATC-1 AAACCTGAGCTAACTC-1 AAACCTGAGCTAGTGG-1 AAACCTGCACATTAGC-1
## 2394 1694 4520 2788
## AAACCTGCACTGTTAG-1 AAACCTGCATAGTAAG-1
## 4667 4440
So, colSum of how many genes contain a certain unique molecular tag is thus count of that UMI. You can put that back in my_cds as UMI value. And with x axis as num_genes_expressed and UMI (count) as y axis, you can draw the following plot.
ggplot(pData(my_cds), aes(num_genes_expressed, UMI)) + geom_point()
This plot has UMI count on the left (how many of unique degenerate tags like “AAACCTGAGCATCATC” are present in the expression data.)
So single-cell sequencing data is a high-dimensional information of 20000~30000 genes in several hundred thousands to millions of cells. It takes a lot of time to compute high-dimensional data and it’s also vulnerable to various noises. Therefore we combine about 30000 genes in the analysis and perform dimensionality reduction to about tens of genes. The Monocle package itself provides cell clustering analysis methods. There is a tendency for good results when using marker genes, but when such pre-obtained knowledge isn’t available we must perform clustering analysis without selecting for particular marker genes. We make use of average expression and variants of each gene throughout cells.
So, take genes with mean_expression over 0.1 that we clustered earlier. Use setOrderingFilter() function to designate which genes will be used for clustering.
plot_ordering_genes() function gives us average expression level compared to the empirically obtained empirical dispersion, and it marks gene set to be used in cluster analysis with black dot for emphasis.
disp_table <- dispersionTable(my_cds)
head(disp_table)
## gene_id mean_expression dispersion_fit dispersion_empirical
## 1 ENSG00000238009 0.0026983425 52.863362 0.000000000
## 2 ENSG00000279457 0.1548382003 1.068659 0.356959667
## 3 ENSG00000228463 0.0543320140 2.767983 0.005359107
## 4 ENSG00000237094 0.0022958750 62.104025 30.358663597
## 5 ENSG00000230021 0.0002830376 502.693229 0.000000000
## 6 ENSG00000237491 0.0245526857 5.943231 0.000000000
table(disp_table$mean_expression >= 0.1)
##
## FALSE TRUE
## 15422 4012
So what’s a table?
class(table(disp_table$mean_expression >= 0.1))
## [1] "table"
Table class object can be created by converting a factor into table of occurrence of each factor in the factor object.
https://www.cyclismo.org/tutorial/R/types.html#tables
Take only the subset genes of disp_table where the mean_expression is greater than 0.1.
unsup_clustering_genes <- subset(disp_table, mean_expression >= 0.1)
Then, with setOrderingFilter, you designate which genes will be used for clustering. So you take gene_ids we obtained from subsetting for mean_expression bigger than 0.1 and select for those genes in my_cds (my cell data) and assign that data back in my_cds.
my_cds <- setOrderingFilter(my_cds, unsup_clustering_genes$gene_id)
Now, plot it out with plot_ordering_genes(), where it plots empirical dispersion gained from empirical knowledge versus mean_expression.
plot_ordering_genes(my_cds)
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous y-axis
clusterCells() function performs cluster analysis of CellDataSet objects. (Arguments are arbitrarily selected to 10 dimension, 15 clusters.) The cluster analysis results are stored in phenoData table.
Now we reduce dimensionality through a process called reduceDimension.
So what’s tSNE?
t-SNE = t-Stochastic Neighbor Embedding. It reduces dimensions so that two vectors that are similar in higher dimensions are also closely related, distance-wise, in two-dimensional space as well so that we can also visualize how similarly the points are clustered together. This is a means of visualizing the similarity in higher dimensions in the lower dimensions. Refer to this blog article https://lovit.github.io/nlp/representation/2018/09/28/tsne/. Also refer to this paper where t-SNE was first proposed. This blog article is also a good mathematical take on explaining the manifold learning methods.
Actually there are many ways (algorithms) to reduce dimensionality: MDS (multi-dimensional scaling), ISOMAP, Locally Linear Embedding (LLE), t-SNE (and perhaps PCA).
These algorithms put emphasis on conservation of different information.
Now, LLE accounts for closest k number of points, but doesn’t care at all about other points. The whole structural information COULD in fact be conserved, but there’s no guarantee. In order to solve this we consider points that are closer to the original space, but also must consider points that are further away. Following excerpt from the t-SNE paper.
“In particular, most of the techniques are not capable of retaining both the local and the global structure of the data in a single map.”
Most of the techniques previously mentioned are not able to retain both the local and global structure of the data. But t-SNE can. It has a parameter called perplexity, rather than k like in LLE, but does not react too sensitively to the parameter. There’s a certain range of well-working value obtained from empirical knowledge. Such robustness of t-SNE in retaining the data structure and reflecting/projecting that onto a 2D graphical map is why we use t-SNE in dimensionality reduction.
How to embed high dimensional vectors to low dimensional space
Original data similarity \(p_{ij}\) and embedding space’s data similarity \(q_{ij}\) are defined. In order to get \(p_{ij}\), you gotta calculate the similarity between two points \(x_i\) to \(x_j\), which is designated as \(p_{j|i}\). This similarity is in the form of probability. In order to define this similarity value, first compute Euclidean distance \(|x_i-x_j|\) from the standard point \(x_i\) to all the other points. And based off of this euclidean distance, express the distance between \(x_i\) and \(x_j\) in terms of probability.
\(\sigma_i\) shows the standard deviation. Divide distance between \(a_i\) and \(a_j\) and take the negative exponential of that.
\(exp(\frac{-|x_i-x_j|^2}{2\sigma_i^2})\)
This distribution is non-negative at all values, and it becomes bigger and bigger as the two points are closer together.
If you divide this value by the sum of that expression over all points, whereas \(\sum_{k=1} exp(\frac{-|x_i-x_k|^2}{2\sigma_i^2})\), then you get a probability-looking stuff.
\[p_{j|i}=\frac{exp(\frac{-|x_i-x_j|^2}{2\sigma_i^2})}{\sum_{k \ne i} exp(\frac{-|x_i-x_k|^2}{2\sigma_i^2})}\]
This \(p_{j|i}\) performs role as a stochastic modeling. Stochastic modeling means a model where a point moves to another point at each time according to a certain probability distribution. So we define here the probability of point \(x_i\) moving to \(x_j\) at the next time point as \(p_{j|i}\).
Note that \(p_{j|i}\) is not necessarily equal to \(p_{i|j}\). Therefore, for symmetry, we define the arithmetic average as the similarity between two points. And then because there are total \(n\) points, we can make the entire sum of similarities of all points equal to 1.
\[p_{ij} = \frac{p_{i|j}+p_{j|i}}{2n}\]
You could also define \(p_{ij}\) like below using identical \(\sigma\) for all \(x_i\)s. It could be problematic for dataset with outliers, though.
\[p_{ij}=\frac{exp(-|x_i-x_j|^2/2\sigma ^2)}{\sum _{k \ne l}exp(-|x_i-x_j|^2/2\sigma ^2)}\] Now that we defined similarity \(p_{ij}\) in the original space, define similarity \(q_{ij}\) in the embedding space. It’s also defined so that \(\sum _{i,j}q_{ij}=1\). For example, you could use \((|y_i-y_j|^2)^{-1}\) as similarity, but since inverse of something infinitesimally small is close to infinity, we add \((1+|y_i-y_j|^2)^{-1}\) to get stable inverse.
\[q_{ij} = \frac{(1+|y_i-y_j|^2)^{-1}}{\sum _{k \ne l}(1+|y_k-y_l|^2)^{-1}}\]
Now that we defined two similarities in two spaces (original and embedding), t-SNE learns by data points to get the \(y_i\) and \(y_j\) that defines the \(q_{ij}\) that goes close to \(p_{ij}\). It’s like moving around the position of \(y_i\), \(y_j\) while looking at target \(p_{ij}\).
We use gradient descent in t-SNE for learning. It moves around points of \(y\) around with the rate of change defined by gradient:
\[\frac{\partial C}{\partial y_i} = \sum _j (p_{ij}-q_{ij})(y_i-y_j) \frac{1}{1+|y_i-y_j|^2}\]
All this concept and \(+ \alpha\) is nicely explained in this article.
Now that we’ve covered the very basic concepts… let’s get to it.
# Reduce dimensionality
my_cds <- reduceDimension(my_cds, max_components = 2, num_dim = 10, reduction_method = 'tSNE', verbose = TRUE)
## Remove noise by PCA ...
## Reduce dimension by tSNE ...
reduceDimension function uses both PCA and tSNE.
my_cds <- clusterCells(my_cds, num_clusters=15)
## Distance cutoff calculated to 4.315355
Why is it different from the book? Idk.
head(pData(my_cds))
## barcode Size_Factor num_genes_expressed UMI
## AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1 0.5672842 871 2394
## AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1 0.4014116 806 1694
## AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1 1.0710628 1316 4520
## AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1 0.6606467 898 2788
## AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1 1.1058961 1526 4667
## AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1 1.0521060 1495 4440
## Cluster peaks halo delta rho
## AAACCTGAGCATCATC-1 2 FALSE FALSE 0.1649455 113.7337
## AAACCTGAGCTAACTC-1 6 FALSE TRUE 0.4159332 147.4402
## AAACCTGAGCTAGTGG-1 4 FALSE TRUE 0.3108593 130.9082
## AAACCTGCACATTAGC-1 10 FALSE TRUE 0.7702480 123.1802
## AAACCTGCACTGTTAG-1 3 FALSE TRUE 0.4021566 144.6951
## AAACCTGCATAGTAAG-1 3 FALSE TRUE 0.3091744 142.0625
## nearest_higher_density_neighbor
## AAACCTGAGCATCATC-1 NA
## AAACCTGAGCTAACTC-1 NA
## AAACCTGAGCTAGTGG-1 NA
## AAACCTGCACATTAGC-1 NA
## AAACCTGCACTGTTAG-1 NA
## AAACCTGCATAGTAAG-1 NA
my_cluster_dim_10 <- pData(my_cds)$Cluster
plot_cell_clusters(my_cds)
The main interest in single-cell transcriptome analysis is finding novel markers that show different expression patterns according to cell types. Discovering marker genes according to cell’s gene expression level can be done with differentialGeneTest() function. For this job, create a character vector, and if a particular cell belongs to cluster 1, designate ‘yes’, if not, ‘no’. You can see which genes are differentially expressed in cluster 1 apart from all the other cells.
my_vector <- rep('no', nrow(pData(my_cds))) # Repeat 'no'
my_vector[pData(my_cds)$Cluster == 1] <- rep('yes', sum(pData(my_cds)$Cluster == 1))
# take index of my_vector where Cluster value is equal to 1, and repeat that many "yes"s and change them with that vector.
Make a new column under name of test in phenotype data of my_cds and put the values in my_vector in there for each cell (sample).
pData(my_cds)$test <- my_vector
head(pData(my_cds))
## barcode Size_Factor num_genes_expressed UMI
## AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1 0.5672842 871 2394
## AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1 0.4014116 806 1694
## AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1 1.0710628 1316 4520
## AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1 0.6606467 898 2788
## AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1 1.1058961 1526 4667
## AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1 1.0521060 1495 4440
## Cluster peaks halo delta rho
## AAACCTGAGCATCATC-1 2 FALSE FALSE 0.1649455 113.7337
## AAACCTGAGCTAACTC-1 6 FALSE TRUE 0.4159332 147.4402
## AAACCTGAGCTAGTGG-1 4 FALSE TRUE 0.3108593 130.9082
## AAACCTGCACATTAGC-1 10 FALSE TRUE 0.7702480 123.1802
## AAACCTGCACTGTTAG-1 3 FALSE TRUE 0.4021566 144.6951
## AAACCTGCATAGTAAG-1 3 FALSE TRUE 0.3091744 142.0625
## nearest_higher_density_neighbor test
## AAACCTGAGCATCATC-1 NA no
## AAACCTGAGCTAACTC-1 NA no
## AAACCTGAGCTAGTGG-1 NA no
## AAACCTGCACATTAGC-1 NA no
## AAACCTGCACTGTTAG-1 NA no
## AAACCTGCATAGTAAG-1 NA no
length(unsup_clustering_genes$gene_id)
## [1] 4012
# Differential expression of cluster 1
de_cluster_one <- differentialGeneTest(my_cds[unsup_clustering_genes$gene_id,],
fullModelFormulaStr = '~test',
cores = 4)
differentialGeneTest usage:
differentialGeneTest(cds, fullModelFormulaStr="~?????", cores=# (1-4)) https://rdrr.io/bioc/monocle/man/differentialGeneTest.html
So the formula string is what specifies what to analyze the difference according to. And here we have to analyze by the column test (thus ~test as fullModelFormulaStr) and its value as in yes or no.
The function tests each gene for differential expression whether they change as a function of pseudotime or according to other covariates as specified. It’s Monocle’s main differential analysis routine and it accepts CellDataSet and two model formulae as input, which specify generalized lineage models as implemented by the VGAM package. VGAM stands for vector generalized linear and additive models. It’s an implementation of about 6 major classes of statistical regression models. The central algorithm is Fisher scoring and iterative reweighted least squares. At the very heart of this package are data-driven VGLMs that use smoothing. The book “Vector Generalized Linear and Additive Models: With an Implementation in R” (Yee, 2015) gives details of the statistical framework and the package.
dim(de_cluster_one)
## [1] 4012 8
Then we order the differentially expressed genes according to q-value (adjusted p-value that signifies the statistical significance.)
de_cluster_one %>% arrange(qval) %>% head()
## status family pval qval id
## 1 OK negbinomial.size 0.000000e+00 0.000000e+00 ENSG00000163131
## 2 OK negbinomial.size 0.000000e+00 0.000000e+00 ENSG00000204287
## 3 OK negbinomial.size 0.000000e+00 0.000000e+00 ENSG00000196126
## 4 OK negbinomial.size 1.204779e-319 9.864436e-318 ENSG00000145649
## 5 OK negbinomial.size 4.354764e-284 3.176602e-282 ENSG00000138795
## 6 OK negbinomial.size 2.162205e-236 1.188324e-234 ENSG00000182287
## gene_short_name num_cells_expressed use_for_ordering
## 1 CTSS 5547 TRUE
## 2 HLA-DRA 6195 TRUE
## 3 HLA-DRB1 5303 TRUE
## 4 GZMA 2012 TRUE
## 5 LEF1 2519 TRUE
## 6 AP1S2 3697 TRUE
It shows that CTSS (Ensembl gene id: ENSG00000163131) gene is the most significantly differentially expressed gene.
Now we can visualize the gene’s expression level according to the cluster with plot_genes_jitter().
plot_genes_jitter(my_cds['ENSG00000163131',], grouping = "Cluster")
You can see that clusters 3, 6, 13 have CTSS gene differentially expressed (statistically significant) from cluster 1.
You can visualize this cluster on the plot with plot_cell_clusters yet again, but now with clusters 3, 6, 13 vs. others.
pData(my_cds)$my_colour <- pData(my_cds)$Cluster == 3 | pData(my_cds)$Cluster == 6 | pData(my_cds)$Cluster == 13
plot_cell_clusters(my_cds, color_by = "my_colour")
Clustering is a classification process where you take a bunch of similar cells and cluster them together as bundles. But that’s not the end of the story. You can look at the clusters and connect them to each other due to expression pattern similarity, yet again. So the idea is, developing and differentiating cells must change their states quite continuously (but if time interval or resolution is too long, or if a drastic cell differentiation takes place at a rapid pace it will seem the cells are discrete. So you gotta be careful.) and cells are occupants of a vast, complex landscape of possible states rather than precisely defined, cut-out entities of certain types. If you connect the states together it will look like a smooth cylinder in space. (I’m not sure as to what this means. \(\rightarrow\) Refer to this bioarxiv article called “The transcriptome dynamiccs of single cells during the cell cycle”.)
By simplifying these connectivity (the relationship), looking for the central axis, and arranging these cells along that central axis, one can reconstruct the trajectory of cellular differentiation as to how its expression changes (based on the few variables, or genes, that are main components in the reduced dimension). One can think of this concept as time axis projected onto a space axis, so people often call it the “pseudo-time” analysis. There are various pseudo-time analysis methods and it would be advisable to try many different methods because there exists as many optimal ways of tackling this problem as there exists structural diversity.
Example.
The trajectory above shows the trajectory followed by olfactory neurons as they develop in mice, from progenitor cells to precursor cells to immature olfactory neurons to mature ones. Each colored point represents a cell, which are connected into a minimum spanning tree. Minimum Spanning Tree or MST is the core data structure Monocle uses to find the trajectory, shown in black lines. Each cell’s pseudotime value is measured as the distance along the trajectory from its position back to the beginning. (Traverse the line and measure the trajectory distance… this is what you get as the measurement of pseudotime.) Refer to this page at Trapnell lab that developed Monocle. Also, “The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells”
Tutorial describes Monocle like this: Monocle uses an algorithm to learn the sequence of gene expression changes each cell must go through as part of a dynamic biological process. Once it has learned the overall “trajectory” of gene expression changes, Monocle can place each cell at its proper position in the trajectory
In this practice, we use algorithms that comprise of reducing dimensionality, MST building, ordering cells in pseudotime, and labeling cells to discover the chronological (?) order of gene expression. First of all, the trajectory analysis consists of mainly three stages:
… And you might add, labeling cells.
For the trajectory analysis, I will use only a subset of all cells (clusters 3, 6, 13, designated by pData(my_cds)$my_colour) and genes that are expressed in at least 10 cells.
expressed_genes <- row.names(subset(fData(my_cds), num_cells_expressed >= 10))
# my_coolour is TRUE if a cell belongs to either cluster 3, 6, or 13.
my_cds_subset <- my_cds[expressed_genes, pData(my_cds)$my_colour]
# 15466 genes & 1904 cells
my_cds_subset
## CellDataSet (storageMode: environment)
## assayData: 15446 features, 1904 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: AAACCTGAGCTAACTC-1 AAACCTGCACTGTTAG-1 ...
## TTTGTCATCGTCTGAA-1 (1904 total)
## varLabels: barcode Size_Factor ... my_colour (12 total)
## varMetadata: labelDescription
## featureData
## featureNames: ENSG00000238009 ENSG00000279457 ... ENSG00000271254
## (15446 total)
## fvarLabels: id gene_short_name num_cells_expressed use_for_ordering
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
So we got 15466 genes (features), and 1904 cells (samples).
We’ll use the recommended approach of ordering based on genes that differ between clusters. First, we’ll perform another subset of genes, keeping only genes expressed in greater than 5% of the cell population.
First, select for only the expressed genes and get their pData
# samples (each cell)
head(colnames(my_cds_subset))
## [1] "AAACCTGAGCTAACTC-1" "AAACCTGCACTGTTAG-1" "AAACCTGCATAGTAAG-1"
## [4] "AAACCTGGTAGAAGGA-1" "AAACCTGGTTTGCATG-1" "AAACCTGTCTCGCTTG-1"
# genes
head(rownames(my_cds_subset))
## [1] "ENSG00000238009" "ENSG00000279457" "ENSG00000228463" "ENSG00000237094"
## [5] "ENSG00000237491" "ENSG00000225880"
# Select for genes that are expressed in more than
my_cds_subset <- my_cds[expressed_genes, pData(my_cds)$my_colour]
# row = genes that are expressed in at least 10 cells
# column = TRUE for clusters 3, 6, and 13 that have CTSS significant differential expression.
my_cds_subset <- detectGenes(my_cds_subset, min_expr = 0.1)
# select for genes with expression levels greater than 0.1
fData(my_cds_subset)$use_for_ordering <- fData(my_cds_subset)$num_cells_expressed > 0.05*ncol(my_cds_subset)
# Boolean value goes into user_for_ordering to identify which genes are expressed in more than 5% of the total cell population.
table(fData(my_cds_subset)$use_for_ordering)
##
## FALSE TRUE
## 9358 6088
Reduce dimensions
my_cds_subset <- reduceDimension(my_cds_subset,
max_components = 2,
norm_method = 'log',
num_dim = 10,
reduction_method = 'tSNE',
verbose = TRUE)
## Remove noise by PCA ...
## Reduce dimension by tSNE ...
my_cds_subset <- clusterCells(my_cds_subset, verbose = FALSE)
## Distance cutoff calculated to 3.128448
rho_threshold: The threshold of local density (rho) used to select the density peaks
delta_threshold: The threshold of local distance (delta) used to select the density peaks
skip_rho_sigma: A logic flag to determine whether or not you want to skip the calculation of rho / sigma
my_cds_subset <- clusterCells(my_cds_subset,
rho_threshold = 2,
delta_threshold = 10,
skip_rho_sigma = T,
verbose=TRUE)
## Use the user provided rho and delta for assigning the density peaks and clusters
plot_cell_clusters(my_cds_subset)
There are total five clusters, and cluster info is stored in my_cds_subset$Cluster.
Execute DEG like we did before.
head(my_cds_subset$Cluster)
## [1] 2 3 3 3 3 4
## Levels: 1 2 3 4 5
clustering_DEG_genes <- differentialGeneTest(my_cds_subset,
fullModelFormulaStr = '~Cluster',
cores = 4)
# Differential anlysis according to Cluster column (what we got from clusterCells)
dim(clustering_DEG_genes)
## [1] 15446 8
clustering_DEG_genes %>% arrange(qval) %>% head()
## status family pval qval id gene_short_name
## 1 OK negbinomial.size 0 0 ENSG00000163220 S100A9
## 2 OK negbinomial.size 0 0 ENSG00000163221 S100A12
## 3 OK negbinomial.size 0 0 ENSG00000143546 S100A8
## 4 OK negbinomial.size 0 0 ENSG00000203747 FCGR3A
## 5 OK negbinomial.size 0 0 ENSG00000163736 PPBP
## 6 OK negbinomial.size 0 0 ENSG00000090382 LYZ
## num_cells_expressed use_for_ordering
## 1 1893 TRUE
## 2 1593 TRUE
## 3 1879 TRUE
## 4 339 TRUE
## 5 102 TRUE
## 6 1901 TRUE
Arrange the genes in order of most significantly differential expressed (smallest q-value), reduce dimensions, then pseudotime analyze to follow the trajectory of cell development/differentiation.
my_ordering_genes <- row.names(clustering_DEG_genes)[order(clustering_DEG_genes$qval)][1:1000]
# You can also write it as:
# row.names(clustering_DEG_genes[order(clustering_DEG_genes$qval)])[1:1000]
With this list of ordered genes with the most differentially expressed to the 1000th,
my_cds_subset <- setOrderingFilter(my_cds_subset, ordering_genes = my_ordering_genes)
my_cds_subset <- reduceDimension(my_cds_subset, method='DDRTree')
DDRTree is an R implementation of DDRTree algorithm from Qi Mao, Li Wang, et al. (Qi Mao, Li Wang, Steve Goodison, and Yijun Sun. Dimensionality Reduction via Graph Structure Learning. The 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’15), 2015 http://dl.acm.org/citation.cfm?id=2783309. paper in pdf).
Its rdrr page is over here.
my_cds_subset <- orderCells(my_cds_subset)
plot_cell_trajectory(my_cds_subset, color_by = "Cluster")
Q. Why does it look completely different in both R Markdown Notebook and the textbook hardcopy?
As previously mentioned, RNA editing is change in base after transcription and before translation while RNA is floating around. RNA editing can be classified into two categories: substitution-like (by deamination… mostly C-to-U and A-to-I.) and indel. In substitution-like RNA editing, each nucleotide is chemically modified into another base. ADAR (adenosine deaminase acting on RNA) acts on RNA nucleotides and adenosine turns to inosine (A-to-I). This contributes to about 90% of the entire RNA editing cases. RNA editing occurs throughout various RNA regions, from 5’UTR, 3’UTR, non-coding RNA, to coding regions and known to occur frequently in Alu regions. RNA editing affects the function of many transcripts, especially alternative splicing and gene expression. It’s an important molecular mechanism in regulating protein function and is associated with various human disease phenotypes. RNA editing analysis using RNA-seq can be classified into two types:
The first is excavating novel RNA editing phenomena by comparing RNA-Seq and DNA-Seq data of the same sample. VCF files from RNA-seq and DNA-seq are compared side-by-side and the base where chromosome, position, and reference sequence are the same but have different variants is reported as where RDD (RNA-DNA Difference) phenomenon occurs. Called RDDs are RNA editing candidates.
Second is comparing RNA editing position with reference sequence to find RDDs (simply resequencing) and confirming these RDDs by mapping them to known RNA editing positions. Database DARNED (DAtabase of RNa EDiting in humans) provides about 42,000 known RNA editing positions.
After finding RDDs and selecting for RNA editing positions, we gotta annotate. ExpEdit(http://www.caspur.it/ExpEdit/) and RCARE (RNA-Seq Comparison and Annotation for RNA Editing; http://www.snubi.org/software/rcare/) exist for this purpose. ExpEdit only supports hg 18, so it’s not relevant today. RCARE uses hg 19 reference genome and it can both annotate biological functions and visualize the analysis. In this exercise, we’re gonna use HeLa cell line data that originates from cervical cancer to learn how to analyze RNA editing information with RCARE.
Here’s a review article on RNA editing and RNA editing analysis and annotation pipelines by Su Youn Lee and Prof. Ju Han Kim: Bioinformatics Approaches for the Identification and Annotation of RNA Editing Sites. It says there have been many novel RNA editing discovery tools developed with the advent of high-throughput NGS technology, but each pipeline has distinct shortcomings. For example, a lot of false positives have been a problem. It’s in part due to inherent limitation of currently available sequencing technologies and also in part due to incompleteness of bioinformatics analysis methods.
So let’s look at RNA editing pipeline. This picture is from the forementioned review paper.
Many RNA editing sites have been reported to be present in a variety of human diseases. To list a few, there have been reports of RNA editing existent in epilepsy, brain ischemia, depression and brain tumors.3 However, RNA editing seems to be one of the regulatory process to ensure biological diversity in our body.
Apolipoprotein B (human APOB) gene is a representative case of such diverse phenotype of one gene. APOB gene consists of 29 exons and total of 4,564 codons, including the stop codon. This gene is expressed in liver cells (hepatocytes) and gastrointestinal tract (especially intestinal cells: duodenum, small intestine, colon). In liver cells, all 4563 codons are translated into protein. This is ApoB-100, which is the largest of the apoB group of proteins. but in intestinal cells 2153th codon (CAA) turns into UAA by cytidine deaminase (this is the trans-acting tissue-specific RNA editing gene that determines which isoform is ultimately produced) and gets converted into stop codon. This is ApoB48. Therefore, ApoB48 and ApoB100 share a common N-terminal sequence, but ApoB48 lacks ApoB100’s C-terminal LDL receptor binding region; therefore does not bind to LDL receptors. ApoB48 exists exclusively in small intestine chylomicrons.^[[http://www.incodom.kr/RNA_editing]
Here is a list of resources for RNA editing research.
The practice dataset is ENCODE (http://genome.ucsc.edu/ENCODE) data of HeLa cell line’s DNA and RNA VCF (partial).
[soh1@x020 RDD]$ head DNA.vcf
##fileformat=VCFv4.1
#CHR POS ID REF ALT QUAL FILTER INFO FORMAT HeLa-S3
1 89923 . A T 31.8 PASS DP=3;VDB=6.240000e-02;AF1=1;AC1=2;DP4=0,0,1,1;MQ=50;FQ=-33 GT:PL:GQ 1/1:63,6,0:10
1 134667 . A G 87.3 PASS DP=5;VDB=6.970230e-02;AF1=1;AC1=2;DP4=0,0,4,1;MQ=50;FQ=-42 GT:PL:GQ 1/1:120,15,0:27
1 138875 . G A 21 PASS DP=9;VDB=8.150059e-02;RPB=0.000000e+00;AF1=0.5002;AC1=1;DP4=3,0,3,1;MQ=46;FQ=6.16;PV4=1,1,0.22,1 GT:PL:GQ 0/1:51,0,32:35
1 321061 . C G 26 PASS DP=21;VDB=8.828460e-02;RPB=-1.030420e+00;AF1=0.5;AC1=1;DP4=9,1,6,0;MQ=41;FQ=29;PV4=1,1,0.46,1 GT:PL:GQ 0/1:56,0,91:59
1 564461 . A G 27.8 PASS DP=2;VDB=5.960000e-02;AF1=1;AC1=2;DP4=0,0,2,0;MQ=50;FQ=-33 GT:PL:GQ 1/1:59,6,0:10
1 564467 . A G 23 PASS DP=4;VDB=5.960000e-02;RPB=8.745357e-01;AF1=1;AC1=2;DP4=1,0,2,0;MQ=41;FQ=-30;PV4=1,0.23,1,1 GT:PL:GQ 1/1:53,3,0:6
1 567242 . G C 222 PASS DP=222;VDB=4.677715e-02;RPB=9.763200e+00;AF1=1;AC1=2;DP4=41,0,120,7;MQ=43;FQ=-120;PV4=0.2,4.8e-07,1,1 GT:PL:GQ 1/1:255,93,0:99
1 569717 . T C 125 PASS DP=17;VDB=6.710477e-02;RPB=1.448414e+00;AF1=0.7304;AC1=1;DP4=1,0,4,7;MQ=46;FQ=-25;PV4=0.42,0.14,0.34,1 GT:PL:GQ 0/1:154,0,2:5
[soh1@x020 RDD]$ head RNA.vcf
##fileformat=VCFv4.1
#CHR POS ID REF ALT QUAL FILTER INFO FORMAT HeLa-S3
1 89862 . T G 42 PASS DP=3;VDB=5.980293e-02;AF1=1;AC1=2;DP4=0,0,3,0;MQ=50;FQ=-36 GT:PL:GQ 1/1:74,9,0:16
1 90789 . G C 38.5 PASS DP=6;VDB=6.499381e-02;AF1=1;AC1=2;DP4=0,0,3,1;MQ=43;FQ=-39 GT:PL:GQ 1/1:71,12,0:21
1 91581 . G A 23.8 PASS DP=2;VDB=7.040000e-02;AF1=1;AC1=2;DP4=0,0,0,2;MQ=50;FQ=-33 GT:PL:GQ 1/1:55,6,0:10
1 142492 . G A 80.3 PASS DP=6;VDB=9.464066e-02;AF1=1;AC1=2;DP4=0,0,4,1;MQ=50;FQ=-42 GT:PL:GQ 1/1:113,15,0:27
1 320229 . G T 47 PASS DP=3;VDB=3.576708e-02;AF1=1;AC1=2;DP4=0,0,3,0;MQ=50;FQ=-36 GT:PL:GQ 1/1:79,9,0:16
1 322852 . T C 44 PASS DP=6;VDB=5.498856e-02;RPB=1.291774e+00;AF1=0.5;AC1=1;DP4=0,2,1,2;MQ=50;FQ=18.1;PV4=1,0.33,1,0.02 GT:PL:GQ 0/1:74,0,45:48
1 564654 . G A 125 PASS DP=21;VDB=2.406943e-02;AF1=1;AC1=2;DP4=0,0,17,0;MQ=49;FQ=-78 GT:PL:GQ 1/1:158,51,0:99
1 566849 . G A 211 PASS DP=396;VDB=6.920000e-06;RPB=-1.075361e+00;AF1=1;AC1=2;DP4=0,1,4,206;MQ=45;FQ=-282;PV4=1,0.45,0.32,0.1 GT:PL:GQ 1/1:244,255,0:99
If you look at both files and compare them, there must be where they are same in chromosome number and position, but have different alt alleles. The following is the python code that does just that:
r_dra = {} # DNA
#1 89862 . T G 42 PASS DP=3;VCF=5.980293e-02;AF1=1;AC1=2;DP4=0,0,3,0;MQ=50;FQ=-36 GP:PL:GQ 1/1:74,9,0:16
#0 1 2 3 4 5 6 7 8 9
f=open('DNA.vcf', 'r') # Open DNA.vcf
for line in f.readlines(): # read all lines iteratively
if line.startswith('#'): pass # If it's a header, just pass.
else: # for other lines...
f_sp = line.strip().split('\t')
r_dra[f_sp[0]+"\t"+f_sp[1]] = f_sp[2:] # make "chr num and position" as key and the rest (a list) as val.
f.close()
RDD_site=[] # RDD site list is an empty list.
f1=open('RNA.vcf', 'r')
for f1_line in f1.readlines():
if f1_line.startswith('#'): pass
else:
f1_sp = f1_line.strip().split('\t')
if('\t'.join(f1_sp[0:2])) in r_dra : # if the key "chr num and position" exists in DNA vcf data,
DNA = r_dra['\t'.join(f1_sp[0:2])] # ALT seq data (the rest...)
if f1_sp[4] != DNA[2]: # if RNA alt seq and DNA alt seq are NOT the same!
RDD_site.append('chr'+f1_sp[0]+'\t'+'\t'.join(f1_sp[1:]))
# join all the info as tab-delimited string and append to the RDD list.
f1.close()
re=open('RDD.vcf', 'w') # Open and write a new file (in 'w' mode)
re.write('##fileformat=VCFv4.1'+'\n') # The first line describes file format, and then newline
re.write('#CHR\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tHeLa-RDD\n') # Header: column info.
re.write('\n'.join(RDD_site)) # Add all lines in RDD list joined by newline
re.close()
Let’s annotate the RDD.vcf file that we got from above process (lines of RNA.vcf that have different ALT seq from DNA.vcf ALT seq although the chromosome number and position are identical). Go to RCARE, and upload the RDD.vcf file we made from comparing RNA and DNA vcf files. If you don’t have the vcf, download the preprocess utilities shown on top of Annotation tab.
This means template having multiple segements in sequencing, SEQ of the next segment in the template being reverse complemented, the first segment in the template, and secondary alignment. Secondary alignment means that it’s got multiple places that the read aligns to reasonably well. This topic on biostar talks about secondary alignments in BAM. It’s different from duplicate (0x400 in FLAG) - where as duplicates are created by abnormal amplification of specific fragments during PCR of library preparation process due to technical issues. This is a nice article about how duplciates arise.↩︎